Reminder: If any problems are encountered and the procedure or command output does not provide relevant guidance, see Relevant troubleshooting links for upgrade-related issues.
ncn-m001
ncn-m001
upgrade
weave
and multus
coredns
anti-affinityncn-m002
ncn-m001
(ncn-m001#
) If a typescript session is already running in the shell, then first stop it with the exit
command.
(ncn-m001#
) Start a typescript.
script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_3_ncn-m001.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.
NOTE
If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryption
directory on the master node before upgrading and restore the directory after the node has been upgraded.
(ncn-m001#
) Run ncn-upgrade-master-nodes.sh
for ncn-m002
.
Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m002
NOTE
Theroot
user password for the node may need to be reset after it is rebooted. Additionally, the/etc/cray/kubernetes/encryption
directory should be restored if it was backed up. Once it is restored, thekube-apiserver
on the rebuilt node should be restarted. See Kuberneteskube-apiserver
Failing for details on how to restart thekube-apiserver
.
Repeat the previous step for each other master node excluding ncn-m001
, one at a time.
Before starting Stage 3.2 - Worker node image upgrade, access the Argo UI to view the progress of this stage. Note that the progress for the current stage will not show up in Argo before the worker node image upgrade script has been started.
For more information, see Using the Argo UI and Using Argo Workflows.
NOTE
One of the Argo steps (wait-for-cfs
) will prevent the upgrade of a worker node from proceeding if the CFS component status for that worker is in anError
state, and this must be fixed in order for the upgrade to continue. The following steps can be used to reset the component state in CFS (replaceXNAME
below with theXNAME
for the worker node:
cray cfs components update --error-count 0 <XNAME>
cray cfs components update --state '[]' <XNAME>
NOTE
When upgrading worker nodes which are running DVS, it is not recommended to simultaneously reboot compute nodes. This is to avoid restarting DVS clients and servers at the same time.
There are two options available for upgrading worker nodes.
(ncn-m001#
) Run ncn-upgrade-worker-storage-nodes.sh
for ncn-w001
.
Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w001
NOTE
Theroot
user password for the node may need to be reset after it is rebooted.
Repeat the previous steps for each other worker node, one at a time.
Multiple workers can be upgraded simultaneously by passing them as a comma-separated list into the upgrade script.
In some cases, it is not possible to upgrade all workers in one request. It is system administrator’s responsibility to make sure that the following conditions are met:
If the system has more than five workers, then they cannot all be upgraded with a single request.
In this case, the upgrade should be split into multiple requests, with each request specifying no more than five workers.
No single upgrade request should include all of the worker nodes that have DVS running on them.
(ncn-m001#
) An example of a single request to upgrade multiple worker nodes simultaneously:
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w002,ncn-w003,ncn-w004
ncn-m001
upgradeBy this point, all NCNs have been upgraded, except for ncn-m001
. In the upgrade process so far, ncn-m001
has been the “stable node” – that is, the node from which the other nodes were upgraded. At this point, the
upgrade procedure pivots to use ncn-m002
as the new “stable node”, in order to allow the upgrade of ncn-m001
.
ncn-m001
For any typescripts that were started earlier on ncn-m001
, stop them with the exit
command.
ncn-m001
(ncn-m001#
) Create an archive of the artifacts.
BACKUP_TARFILE="csm_upgrade.pre_m001_reboot_artifacts.$(date +%Y%m%d_%H%M%S).tgz"
ls -d \
/root/apply_csm_configuration.* \
/root/csm_upgrade.* \
/root/output.log 2>/dev/null |
sed 's_^/__' |
xargs tar -C / -czvf "/root/${BACKUP_TARFILE}"
(ncn-m001#
) Upload the archive to S3 in the cluster.
cray artifacts create config-data "${BACKUP_TARFILE}" "/root/${BACKUP_TARFILE}"
ncn-m002
Log out of ncn-m001
.
Log in to ncn-m002
from outside the cluster.
NOTE
Very rarely, a password hash for theroot
user that works properly on a SLES SP2 NCN is not recognized on a SLES SP3 NCN. If password login fails, then log in toncn-m002
fromncn-m001
and use thepasswd
command to reset the password. Then log in using the CMN IP address as directed below. Oncencn-m001
has been upgraded, log in fromncn-m002
and use thepasswd
command to reset the password. The other NCNs will have their passwords updated when NCN personalization is run in a subsequent step.
ssh
to the bond0.cmn0
/CMN IP address of ncn-m002
.
ncn-m002
(ncn-m002#
) Start a typescript.
script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_3_ncn-m002.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
ncn-m002
Authenticate with the Cray CLI on ncn-m002
.
See Configure the Cray Command Line Interface for details on how to do this.
(ncn-m002#
) Set upgrade variables.
source /etc/cray/upgrade/csm/myenv
(ncn-m002#
) Copy artifacts from ncn-m001
.
A later stage of the upgrade expects the
docs-csm
andlibcsm
RPMs to be located at/root/
onncn-m002
; that is why this command copies them there.
Install csi
and docs-csm
.
scp ncn-m001:/root/csm_upgrade.pre_m001_reboot_artifacts.*.tgz /root
zypper --plus-repo="/etc/cray/upgrade/csm/csm-${CSM_RELEASE}/tarball/csm-${CSM_RELEASE}/rpm/cray/csm/sle-$(awk -F= '/VERSION=/{gsub(/["-]/, "") ; print tolower($NF)}' /etc/os-release)" --no-gpg-checks install -y cray-site-init
scp ncn-m001:/root/*.noarch.rpm /root/
rpm -Uvh --force /root/docs-csm-latest.noarch.rpm
Install libcsm
.
NOTE Since
libcsm
depends on versions of Python relative to what is included in the SLES service packs, then in the event thatncn-m002
is running a newer SLES distro a newlibcsm
must be downloaded. This will often be the case when jumping to a new CSM minor version (e.g. CSM 1.3 to CSM 1.4). e.g. ifncn-m001
is running SLES15SP3, andncn-m002
is running SLES15SP4 then the SLES15SP4libcsm
is needed. Follow the Check for latest documentation guide again, but fromncn-m002
.
rpm -Uvh --force /root/libcsm-latest.noarch.rpm
If this step was executed as a result of the management-nodes-rollout
with CSM upgrade
instructions, return to that procedure and continue with the next step.
Otherwise, if performing an upgrade of only CSM, proceed to the next step.
ncn-m001
NOTE
If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryption
directory on the master node before upgrading and restore the directory after the node has been upgraded.
Upgrade ncn-m001
.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
NOTE
Theroot
user password for the node may need to be reset after it is rebooted. Additionally, the/etc/cray/kubernetes/encryption
directory should be restored if it was backed up. Once it is restored, thekube-apiserver
on the rebuilt node should be restarted. See Kuberneteskube-apiserver
Failing for details on how to restart thekube-apiserver
.
weave
and multus
Run the following command to complete the upgrade of the weave
and multus
manifest versions:
/srv/cray/scripts/common/apply-networking-manifests.sh
coredns
anti-affinityRun the following script to apply anti-affinity to coredns
pods:
/usr/share/doc/csm/upgrade/scripts/k8s/apply-coredns-pod-affinity.sh
/usr/share/doc/csm/upgrade/scripts/k8s/upgrade_control_plane.sh
NOTE
:kubelet
has been upgraded already, ignore the warning to upgrade it.
etcd-operator
.helm uninstall -n operators cray-etcd-operator
If this step was executed as part of the IUF
Deploy Product
steps, then return to the IUF Upgrade CSM and Additional Products with IUF and complete the remaining steps underDeploy Product
. Otherwise, proceed to the following topic.
ncn-m002
For any typescripts that were started during this stage on ncn-m002
, stop them with the exit
command.
All Kubernetes nodes have been rebooted into the new image.
REMINDER: If password for
ncn-m002
was reset during Stage 3.3, then also reset the password onncn-m001
at this time.
This stage is completed. Proceed to Validate CSM health during an upgrade