Copyright 2021 Hewlett Packard Enterprise Development LP
This guide contains procedures for upgrading systems running CSM 0.9.3 to CSM 0.9.4. It is intended for system installers, system administrators, and network administrators. It assumes some familiarity with standard Linux and associated tooling.
Procedures:
See CHANGELOG.md in the root of a CSM release distribution for a summary of changes in each CSM release. This patch includes the following changes:
registry.local
and packages.local
to the
/etc/hosts
files on the worker nodes.Cluster
to Local
.For convenience, these procedures make use of environment variables. This section sets the expected environment variables to appropriate values.
Start a typescript to capture the commands and output from this procedure.
ncn-m001# script -af csm-update.$(date +%Y-%m-%d).txt
ncn-m001# export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
Set CSM_SYSTEM_VERSION
to 0.9.3
:
ncn-m001# CSM_SYSTEM_VERSION="0.9.3"
NOTE:
Installed CSM versions may be listed from the product catalog using:ncn-m001# kubectl -n services get cm cray-product-catalog -o jsonpath='{.data.csm}' | yq r -j - | jq -r 'keys[]' | sed '/-/!{s/$/_/}' | sort -V | sed 's/_$//'
Set CSM_DISTDIR
to the directory of the extracted release distribution for
CSM 0.9.4:
NOTE:
Use--no-same-owner
and--no-same-permissions
options totar
when extracting a CSM release distribution asroot
to ensure the extracted files are owned byroot
and have permissions based on the currentumask
value.
If using a release distribution:
ncn-m001# tar --no-same-owner --no-same-permissions -zxvf csm-0.9.4.tar.gz
ncn-m001# CSM_DISTDIR="$(pwd)/csm-0.9.4"
Else if using a hotfix distribution:
ncn-m001# CSM_HOTFIX="csm-0.9.4-hotfix-0.0.1"
ncn-m001# tar --no-same-owner --no-same-permissions -zxvf ${CSM_HOTFIX}.tar.gz
ncn-m001# CSM_DISTDIR="$(pwd)/${CSM_HOTFIX}"
ncn-m001# echo $CSM_DISTDIR
Set CSM_RELEASE_VERSION
to the version reported by
${CSM_DISTDIR}/lib/version.sh
:
ncn-m001# CSM_RELEASE_VERSION="$(${CSM_DISTDIR}/lib/version.sh --version)"
ncn-m001# echo $CSM_RELEASE_VERSION
Download and install/upgrade the latest workaround and documentation RPMs. If this machine does not have direct internet access these RPMs will need to be externally downloaded and then copied to be installed.
ncn-m001# rpm -Uvh https://storage.googleapis.com/csm-release-public/shasta-1.4/docs-csm/docs-csm-latest.noarch.rpm
ncn-m001# rpm -Uvh https://storage.googleapis.com/csm-release-public/shasta-1.4/csm-install-workarounds/csm-install-workarounds-latest.noarch.rpm
After completing the previous step, apply the workaround in the following directory, even if it has been previously applied on the system.
/opt/cray/csm/workarounds/livecd-post-reboot/CASMINST-2689
See the README.md
file in that directory for instructions on how to apply the workaround.
It requires you to run a script.
Set CSM_SCRIPTDIR
to the scripts directory included in the docs-csm RPM
for the CSM 0.9.4 upgrade:
ncn-m001# CSM_SCRIPTDIR=/usr/share/doc/metal/upgrade/0.9/csm-0.9.4/scripts
It is important to first verify a healthy starting state. To do this, run the CSM validation checks. If any problems are found, correct them and verify the appropriate validation checks before proceeding.
Run the update-host-records.sh
script to update /etc/hosts on NCN workers:
ncn-m001# "${CSM_SCRIPTDIR}/update-host-records.sh"
Check for manually created unbound-psp and delete the psp. Helm will manage the psp during the upgrade.
ncn-m001# ${CSM_SCRIPTDIR}/check-unbound-psp.sh
Run lib/remove-service-repos.sh
to remove repositories that are external to the system.
ncn-m001# ${CSM_SCRIPTDIR}/remove-service-repos.sh
Run lib/setup-nexus.sh
to configure Nexus and upload new CSM RPM
repositories, container images, and Helm charts:
ncn-m001# cd "$CSM_DISTDIR"
ncn-m001# ./lib/setup-nexus.sh
On success, setup-nexus.sh
will output OK
on stderr and exit with status
code 0
, e.g.:
ncn-m001# ./lib/setup-nexus.sh
...
+ Nexus setup complete
setup-nexus.sh: OK
ncn-m001# echo $?
0
In the event of an error, consult the known
issues from the install
documentation to resolve potential problems and then try running
setup-nexus.sh
again. Note that subsequent runs of setup-nexus.sh
may
report FAIL
when uploading duplicate assets. This is ok as long as
setup-nexus.sh
outputs setup-nexus.sh: OK
and exits with status code 0
.
Run the vcs-backup.sh
script to backup all VCS content to a temporary
location.
ncn-m001# "${CSM_SCRIPTDIR}/vcs-backup.sh"
Confirm the local tar file vcs.tar
was created. It contains the Git
repository data and will be needed in the restore step. Once
upgrade.sh
is run, the git data will not be
recoverable if this step failed.
If vcs.tar
was successfully created, run vcs-prep.sh
. This will remove the existing pvc in preparation for the upgrade.
ncn-m001# "${CSM_SCRIPTDIR}/vcs-prep.sh"
It is also recommended to save the VCS password to a safe location prior to making changes to VCS. The current password can can be retrieved with:
ncn-m001# kubectl get secret -n services vcs-user-credentials --template={{.data.vcs_password}} | base64 --decode; echo
If you manage customizations.yaml in an external Git repository (as recommended), then clone a local working tree, e.g.:
ncn-m001# git clone <URL> site-init
ncn-m001# cd site-init
Otherwise extract customizations.yaml from the site-init
secret:
ncn-m001# cd /tmp
ncn-m001# kubectl -n loftsman get secret site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d - > customizations.yaml
Remove the Gitea PVC configuration from customizations.yaml:
ncn-m001# yq d -i customizations.yaml 'spec.kubernetes.services.gitea.cray-service.persistentVolumeClaims'
Update the site-init
secret:
ncn-m001# kubectl delete secret -n loftsman site-init
ncn-m001# kubectl create secret -n loftsman generic site-init --from-file=customizations.yaml
Commit changes to customizations.yaml if using an external Git repository, e.g.:
ncn-m001# git add customizations.yaml
ncn-m001# git commit -m 'Remove Gitea PVC configuration from customizations.yaml'
ncn-m001# git push
Run upgrade.sh
to deploy upgraded CSM applications and services:
ncn-m001# cd "$CSM_DISTDIR"
ncn-m001# ./upgrade.sh
Note: If you have not already installed the workload manager product
including slurm and munge, then the cray-crus
pod is expected to be in the
Init
state. After running upgrade.sh
, you may observe there are now two
copies of the cray-crus
pod in the Init
state. This situation is benign and
should resolve itself once the workload manager product is installed.
NOTE:
This fix only applies to new sessions and will not correct sessions that are already in the stuck state.
Delete all sessions that are in stuck:
ncn-m001# cray cfs sessions list --format json | jq -r '.[] | select(.status.session.startTime==null) | .name' | while read name ; do cray cfs sessions delete $name; done
NOTE:
For Gigabyte or Intel NCNs skip this section.
Deploy the set-bmc-ntp-dns.sh
script (and its helper script make_api_call.py
) to each NCN except m001:
ncn-m001# for h in $( grep ncn /etc/hosts | grep nmn | grep -v m001 | awk '{print $2}' ); do
ssh $h "mkdir -p /opt/cray/ncn"
scp "${CSM_SCRIPTDIR}/make_api_call.py" "${CSM_SCRIPTDIR}/set-bmc-ntp-dns.sh" root@$h:/opt/cray/ncn/
ssh $h "chmod 755 /opt/cray/ncn/set-bmc-ntp-dns.sh"
done
Run the /opt/cray/ncn/set-bmc-ntp-dns.sh
script on each NCN except m001.
Pass
-h
to see some examples and use the information below to run the script.
The following process can restore NTP and DNS server values after a firmware update to HPE NCNs. If you update the System ROM of an NCN, you will lose NTP and DNS server values. Correctly setting these also allows FAS to function properly.
ncn# M001_HMN_IP=$(cat /etc/hosts | grep m001.hmn | awk '{print $1}')
ncn# echo $M001_HMN_IP
10.254.1.4
ncn# BMC=ncn-<NCN name>-mgmt # e.g. ncn-w003-mgmt
ncn# export USERNAME=root
ncn# export IPMI_PASSWORD=changeme
ncn# /opt/cray/ncn/set-bmc-ntp-dns.sh ilo -H $BMC -s
time-hmn
and ncn-m001
.
ncn# /opt/cray/ncn/set-bmc-ntp-dns.sh ilo -H $BMC -S -N "time-hmn,$M001_HMN_IP" -n
ncn-m001
.
ncn# /opt/cray/ncn/set-bmc-ntp-dns.sh ilo -H $BMC -D "10.94.100.225,$M001_HMN_IP" -d
NOTE: These scripts should be run from a Kubernetes NCN (manager or worker). Also note it can take several minutes for the target down alerts to clear after the scripts have been executed.
Run the fix-kube-proxy-target-down-alert.sh
script to fix the kube-proxy
alert.
ncn-m001# "${CSM_SCRIPTDIR}/fix-kube-proxy-target-down-alert.sh"
Run the fix-kubelet-target-down-alert.sh
script to fix the kube-proxy
alert.
ncn-m001# "${CSM_SCRIPTDIR}/fix-kubelet-target-down-alert.sh"
Verify the zypper repository in nexus that contains the golang-github-prometheus-node_exporter RPM is enabled. Typically this is the SUSE-SLE-Module-Basesystem-15-SP1-x86_64-Updates repository. If not enabled, enable it (or the repository in nexus that contains the RPM) on all storage nodes. The easiest way to find the repository that contains this RPM is to login to the Nexus UI at https://nexus.SYSTEM-NAME.cray.com, click the search icon in the navigation pane on the left, and enter golang-github-prometheus-node_exporter as the keyword. Then click on the search result that has the latest version of the RPM, and on that screen the repository name to use is listed as the repository at the top.
ncn-m001# for h in $( cat /etc/hosts | grep ncn-s | grep nmn | awk '{print $2}' ); do
ssh $h "zypper ar https://packages.local/repository/SUSE-SLE-Module-Basesystem-15-SP1-x86_64-Updates SUSE-SLE-Module-Basesystem-15-SP1-x86_64-Updates"
done
Copy the install-node-exporter-storage.sh
script out to the storage nodes.
ncn-m001# for h in $( cat /etc/hosts | grep ncn-s | grep nmn | awk '{print $2}' ); do
scp "${CSM_SCRIPTDIR}/install-node-exporter-storage.sh" root@$h:/tmp
done
Run the install-node-exporter-storage.sh
script on each of the storage
nodes to enable the node-exporter:
NOTE: This script should be run on each storage node.
ncn-s# /tmp/install-node-exporter-storage.sh
NOTE: While running install-node-exporter-storage.sh, you may see an error similar to the following:
Error building the cache: [SUSE-SLE-Module-Basesystem-15-SP1-x86_64-Updates|https://packages.local/repository/SUSE-SLE-Module-Basesystem-15-SP1-x86_64-Updates] Valid metadata not found at specified URL History: - [SUSE-SLE-Module-Basesystem-15-SP1-x86_64-Updates|https://packages.local/repository/SUSE-SLE-Module-Basesystem-15-SP1-x86_64-Updates] Repository type can't be determined. Warning: Skipping repository 'SUSE-SLE-Module-Basesystem-15-SP1-x86_64-Updates' because of the above error.
This error can be safely ignored.
The following error may occur for air-gapped systems that do not have connectivity to the internet:
Refreshing service 'Public_Cloud_Module_15_SP2_x86_64'. Problem retrieving the repository index file for service 'Public_Cloud_Module_15_SP2_x86_64': Download (curl) error for 'https://scc.suse.com/access/services/1973/repo/repoindex.xml?cookies=0&credentials=Public_Cloud_Module_15_SP2_x86_64': Error code: Connection failed Error message: Failed to connect to scc.suse.com port 443: Connection timed out
If this error is encountered, move files out of the following directory (for each storage node) and re-run the install-node-exporter-storage.sh script:
/etc/zypp/services.d
Run the vcs-restore.sh
script to restore all VCS content. This should be
run from the same directory that vcs-backup.sh
was run from so that the tar
file can be located. If successful, this script will list the data files
that have been restored.
ncn-m001# "${CSM_SCRIPTDIR}/vcs-restore.sh"
Re-run the csm-config-import job pod if it exists and is in Error
state.
Find the csm-config-import
job pod:
ncn-m001# kubectl get pods -n services | grep csm-config-import
If the pod exists, confirm it is not in an Error state. If the pod is in Error state, then delete it:
ncn-m001# CSM_CONFIG_POD=$(kubectl get pods --no-headers -o custom-columns=":metadata.name" -n services | grep csm-config-import)
ncn-m001# echo $CSM_CONFIG_POD
ncn-m001# kubectl delete pod -n services $CSM_CONFIG_POD
Disable the TPM kernel module from being loaded by the GRUB bootloader.
ncn-m001# "${CSM_SCRIPTDIR}/tpm-fix-install.sh"
IMPORTANT:
Wait at least 15 minutes afterupgrade.sh
completes to let the various Kubernetes resources get initialized and started.
Run the following validation checks to ensure that everything is still working properly after the upgrade:
Other health checks may be run as desired.
CAUTION:
The following HMS functional tests may fail because of locked components in HSM:
test_bss_bootscript_ncn-functional_remote-functional.tavern.yaml
test_smd_components_ncn-functional_remote-functional.tavern.yaml
Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/tavern/schemas/files.py", line 106, in verify_generic verifier.validate() File "/usr/lib/python3.8/site-packages/pykwalify/core.py", line 166, in validate raise SchemaError(u"Schema validation failed:\n - {error_msg}.".format( pykwalify.errors.SchemaError: <SchemaError: error code 2: Schema validation failed: - Key 'Locked' was not defined. Path: '/Components/0'. - Key 'Locked' was not defined. Path: '/Components/5'. - Key 'Locked' was not defined. Path: '/Components/6'. - Key 'Locked' was not defined. Path: '/Components/7'. - Key 'Locked' was not defined. Path: '/Components/8'. - Key 'Locked' was not defined. Path: '/Components/9'. - Key 'Locked' was not defined. Path: '/Components/10'. - Key 'Locked' was not defined. Path: '/Components/11'. - Key 'Locked' was not defined. Path: '/Components/12'.: Path: '/'>
Failures of these tests because of locked components as shown above can be safely ignored.
NOTE:
If you plan to do any further CSM health validation, you should follow the validation
procedures found in the CSM v1.0 documentation. Some of the information in the CSM v0.9 validation
documentation is no longer accurate in CSM v1.0.
Verify the CSM version has been updated in the product catalog. Verify that the
following command includes version 0.9.4
:
ncn-m001# kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r -j - | jq -r 'to_entries[] | .key'
0.9.4
0.9.3
Confirm the import_date
reflects the timestamp of the upgrade:
ncn-m001# kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r - '"0.9.4".configuration.import_date'
Remember to exit your typescript.
ncn-m001# exit
It is recommended to save the typescript file for later reference.