Reminder: If any problems are encountered and the procedure or command output does not provide relevant guidance, see Relevant troubleshooting links for upgrade-related issues.
ncn-m001ncn-m001 upgrade
weave and multuscoredns anti-affinityncn-m002ncn-m001(ncn-m001#) If a typescript session is already running in the shell, then first stop it with the exit command.
(ncn-m001#) Start a typescript.
script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_3_ncn-m001.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.
NOTEIf Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryptiondirectory on the master node before upgrading and restore the directory after the node has been upgraded.
(ncn-m001#) Run ncn-upgrade-master-nodes.sh for ncn-m002.
Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m002
NOTETherootuser password for the node may need to be reset after it is rebooted. Additionally, the/etc/cray/kubernetes/encryptiondirectory should be restored if it was backed up. Once it is restored, thekube-apiserveron the rebuilt node should be restarted. See Kuberneteskube-apiserverFailing for details on how to restart thekube-apiserver.
Repeat the previous step for each other master node excluding ncn-m001, one at a time.
NOTEIfncn-upgrade-master-nodes.shfails, address the problems based on the failed test output, then runncn-upgrade-master-nodes.shagain. There are several common failures, such as goss tests for clock skew. Some of these failures can be resolved by waiting several minutes before runningncn-upgrade-master-nodes.shagain.ncn-upgrade-master-nodes.shdoes not repeat the steps that completed successfully, it runs only the failed and subsequent steps, so it can be executed numerous times without issue.
Before starting Stage 3.2 - Worker node image upgrade, access the Argo UI to view the progress of this stage. Note that the progress for the current stage will not show up in Argo before the worker node image upgrade script has been started.
For more information, see Using the Argo UI and Using Argo Workflows.
NOTEOne of the Argo steps (wait-for-cfs) will prevent the upgrade of a worker node from proceeding if the CFS component status for that worker is in anErrorstate, and this must be fixed in order for the upgrade to continue. The following steps can be used to reset the component state in CFS (replaceXNAMEbelow with theXNAMEfor the worker node:
cray cfs components update --error-count 0 <XNAME>
cray cfs components update --state '[]' <XNAME>
NOTEWhen upgrading worker nodes which are running DVS, it is not recommended to simultaneously reboot compute nodes. This is to avoid restarting DVS clients and servers at the same time.
There are two options available for upgrading worker nodes.
(ncn-m001#) Run ncn-upgrade-worker-storage-nodes.sh for ncn-w001.
Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w001
NOTETherootuser password for the node may need to be reset after it is rebooted.
Repeat the previous steps for each other worker node, one at a time.
Multiple workers can be upgraded simultaneously by passing them as a comma-separated list into the upgrade script.
In some cases, it is not possible to upgrade all workers in one request. It is system administrator’s responsibility to make sure that the following conditions are met:
If the system has more than five workers, then they cannot all be upgraded with a single request.
In this case, the upgrade should be split into multiple requests, with each request specifying no more than five workers.
No single upgrade request should include all of the worker nodes that have DVS running on them.
(ncn-m001#) An example of a single request to upgrade multiple worker nodes simultaneously:
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w002,ncn-w003,ncn-w004
ncn-m001 upgradeBy this point, all NCNs have been upgraded, except for ncn-m001. In the upgrade process so far, ncn-m001
has been the “stable node” – that is, the node from which the other nodes were upgraded. At this point, the
upgrade procedure pivots to use ncn-m002 as the new “stable node”, in order to allow the upgrade of ncn-m001.
ncn-m001For any typescripts that were started earlier on ncn-m001, stop them with the exit command.
ncn-m001(ncn-m001#) Create an archive of the artifacts.
BACKUP_TARFILE="csm_upgrade.pre_m001_reboot_artifacts.$(date +%Y%m%d_%H%M%S).tgz"
ls -d \
/root/apply_csm_configuration.* \
/root/csm_upgrade.* \
/root/output.log 2>/dev/null |
sed 's_^/__' |
xargs tar -C / -czvf "/root/${BACKUP_TARFILE}"
(ncn-m001#) Upload the archive to S3 in the cluster.
cray artifacts create config-data "${BACKUP_TARFILE}" "/root/${BACKUP_TARFILE}"
ncn-m002Log out of ncn-m001.
Log in to ncn-m002 from outside the cluster.
NOTEVery rarely, a password hash for therootuser that works properly on a SLES SP2 NCN is not recognized on a SLES SP3 NCN. If password login fails, then log in toncn-m002fromncn-m001and use thepasswdcommand to reset the password. Then log in using the CMN IP address as directed below. Oncencn-m001has been upgraded, log in fromncn-m002and use thepasswdcommand to reset the password. The other NCNs will have their passwords updated when NCN personalization is run in a subsequent step.
ssh to the bond0.cmn0/CMN IP address of ncn-m002.
ncn-m002(ncn-m002#) Start a typescript.
script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_3_ncn-m002.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
ncn-m002Authenticate with the Cray CLI on ncn-m002.
See Configure the Cray Command Line Interface for details on how to do this.
(ncn-m002#) Set upgrade variables.
source /etc/cray/upgrade/csm/myenv
(ncn-m002#) Copy artifacts from ncn-m001.
A later stage of the upgrade expects the
docs-csmandlibcsmRPMs to be located at/root/onncn-m002; that is why this command copies them there.
Install csi and docs-csm.
scp ncn-m001:/root/csm_upgrade.pre_m001_reboot_artifacts.*.tgz /root
zypper --plus-repo="/etc/cray/upgrade/csm/csm-${CSM_RELEASE}/tarball/csm-${CSM_RELEASE}/rpm/cray/csm/sle-$(awk -F= '/VERSION=/{gsub(/["-]/, "") ; print tolower($NF)}' /etc/os-release)" --no-gpg-checks install -y cray-site-init
scp ncn-m001:/root/*.noarch.rpm /root/
rpm -Uvh --force /root/docs-csm-latest.noarch.rpm
Install libcsm.
NOTE Since
libcsmdepends on versions of Python relative to what is included in the SLES service packs, then in the event thatncn-m002is running a newer SLES distro a newlibcsmmust be downloaded. This will often be the case when jumping to a new CSM minor version (e.g. CSM 1.3 to CSM 1.4). e.g. ifncn-m001is running SLES15SP3, andncn-m002is running SLES15SP4 then the SLES15SP4libcsmis needed. Follow the Check for latest documentation guide again, but fromncn-m002.
rpm -Uvh --force /root/libcsm-latest.noarch.rpm
If this step was executed as a result of the management-nodes-rollout with CSM upgrade
instructions, return to that procedure and continue with the next step.
Otherwise, if performing an upgrade of only CSM, proceed to the next step.
ncn-m001
NOTEIf Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryptiondirectory on the master node before upgrading and restore the directory after the node has been upgraded.
Upgrade ncn-m001.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
NOTETherootuser password for the node may need to be reset after it is rebooted. Additionally, the/etc/cray/kubernetes/encryptiondirectory should be restored if it was backed up. Once it is restored, thekube-apiserveron the rebuilt node should be restarted. See Kuberneteskube-apiserverFailing for details on how to restart thekube-apiserver.
weave and multusRun the following command to complete the upgrade of the weave and multus manifest versions:
/srv/cray/scripts/common/apply-networking-manifests.sh
coredns anti-affinityRun the following script to apply anti-affinity to coredns pods:
/usr/share/doc/csm/upgrade/scripts/k8s/apply-coredns-pod-affinity.sh
/usr/share/doc/csm/upgrade/scripts/k8s/upgrade_control_plane.sh
NOTE:kubelethas been upgraded already, ignore the warning to upgrade it. Additionally, if Kubernetes audit logging is enabled, local configuration changes will be lost as the audit logging configuration will be reset to defaults defined in Audit Logs
etcd-operator.helm uninstall -n operators cray-etcd-operator
If this step was executed as part of the IUF
Deploy Productsteps, then return to the IUF Upgrade CSM and Additional Products with IUF and complete the remaining steps underDeploy Product. Otherwise, proceed to the following topic.
ncn-m002For any typescripts that were started during this stage on ncn-m002, stop them with the exit command.
All Kubernetes nodes have been rebooted into the new image.
REMINDER: If password for
ncn-m002was reset during Stage 3.3, then also reset the password onncn-m001at this time.
This stage is completed. Proceed to Validate CSM health during an upgrade