Reminder: If any problems are encountered and the procedure or command output does not provide relevant guidance, see Relevant troubleshooting links for upgrade-related issues.
ncn-m001
ncn-m001
upgrade
weave
and multus
coredns
anti-affinityncn-m002
ncn-m001
(ncn-m001#
) If a typescript session is already running in the shell, then first stop it with the exit
command.
(ncn-m001#
) Start a typescript.
script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_2_ncn-m001.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.
(ncn-m001#
) Run ncn-upgrade-master-nodes.sh
for ncn-m002
.
Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m002
NOTE
Theroot
user password for the node may need to be reset after it is rebooted.
Repeat the previous step for each other master node excluding ncn-m001
, one at a time.
Before starting Stage 2.2 - Worker node image upgrade, access the Argo UI to view the progress of this stage. Note that the progress for the current stage will not show up in Argo before the worker node image upgrade script has been started.
For more information, see Using the Argo UI and Using Argo Workflows.
NOTE
One of the Argo steps (wait-for-cfs
) will prevent the upgrade of a worker node from proceeding if the CFS component status for that worker is in anError
state, and this must be fixed in order for the upgrade to continue. The following steps can be used to reset the component state in CFS (replaceXNAME
below with theXNAME
for the worker node:
cray cfs components update --error-count 0 <XNAME>
cray cfs components update --state '[]' <XNAME>
NOTE
When upgrading worker nodes which are running DVS, it is not recommended to simultaneously reboot compute nodes. This is to avoid restarting DVS clients and servers at the same time.
There are two options available for upgrading worker nodes.
(ncn-m001#
) Run ncn-upgrade-worker-storage-nodes.sh
for ncn-w001
.
Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w001
NOTE
Theroot
user password for the node may need to be reset after it is rebooted.
Repeat the previous steps for each other worker node, one at a time.
Multiple workers can be upgraded simultaneously by passing them as a comma-separated list into the upgrade script.
In some cases, it is not possible to upgrade all workers in one request. It is system administrator’s responsibility to make sure that the following conditions are met:
If the system has more than five workers, then they cannot all be upgraded with a single request.
In this case, the upgrade should be split into multiple requests, with each request specifying no more than five workers.
No single upgrade request should include all of the worker nodes that have DVS running on them.
(ncn-m001#
) An example of a single request to upgrade multiple worker nodes simultaneously:
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w002,ncn-w003,ncn-w004
ncn-m001
upgradeBy this point, all NCNs have been upgraded, except for ncn-m001
. In the upgrade process so far, ncn-m001
has been the “stable node” – that is, the node from which the other nodes were upgraded. At this point, the
upgrade procedure pivots to use ncn-m002
as the new “stable node”, in order to allow the upgrade of ncn-m001
.
rbd
device from ncn-m001
to ncn-m002
(ncn-m001#
) Remap the CSM release rbd
device to ncn-m002
.
This device was created in Stage 0.1 - Prepare assets.
source /opt/cray/csm/scripts/csm_rbd_tool/bin/activate
python /usr/share/doc/csm/scripts/csm_rbd_tool.py --rbd_action move --target_host ncn-m002
deactivate
IMPORTANT: This mounts the rbd
device at /etc/cray/upgrade/csm
on ncn-m002
.
ncn-m001
For any typescripts that were started earlier on ncn-m001
, stop them with the exit
command.
ncn-m001
(ncn-m001#
) Create an archive of the artifacts.
BACKUP_TARFILE="csm_upgrade.pre_m001_reboot_artifacts.$(date +%Y%m%d_%H%M%S).tgz"
ls -d \
/root/apply_csm_configuration.* \
/root/csm_upgrade.* \
/root/output.log 2>/dev/null |
sed 's_^/__' |
xargs tar -C / -czvf "/root/${BACKUP_TARFILE}"
(ncn-m001#
) Upload the archive to S3 in the cluster.
cray artifacts create config-data "${BACKUP_TARFILE}" "/root/${BACKUP_TARFILE}"
ncn-m002
Log out of ncn-m001
.
Log in to ncn-m002
from outside the cluster.
NOTE
Very rarely, a password hash for theroot
user that works properly on a SLES SP2 NCN is not recognized on a SLES SP3 NCN. If password login fails, then log in toncn-m002
fromncn-m001
and use thepasswd
command to reset the password. Then log in using the CMN IP address as directed below. Oncencn-m001
has been upgraded, log in fromncn-m002
and use thepasswd
command to reset the password. The other NCNs will have their passwords updated when NCN personalization is run in a subsequent step.
ssh
to the bond0.cmn0
/CMN IP address of ncn-m002
.
ncn-m002
(ncn-m002#
) Start a typescript.
script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_2_ncn-m002.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
ncn-m002
Authenticate with the Cray CLI on ncn-m002
.
See Configure the Cray Command Line Interface for details on how to do this.
(ncn-m002#
) Set upgrade variables.
source /etc/cray/upgrade/csm/myenv
echo "${CSM_REL_NAME}"
(ncn-m002#
) Copy artifacts from ncn-m001
and install them.
scp ncn-m001:/root/csm_upgrade.pre_m001_reboot_artifacts.*.tgz /root
zypper --plus-repo="/etc/cray/upgrade/csm/csm-${CSM_RELEASE}/tarball/csm-${CSM_RELEASE}/rpm/cray/csm/sle-$(awk -F= '/VERSION=/{gsub(/["-]/, "") ; print tolower($NF)}' /etc/os-release)" --no-gpg-checks install -y cray-site-init
scp ncn-m001:/root/*.noarch.rpm /root/
rpm -Uvh --force /root/docs-csm-latest.noarch.rpm
ncn-m001
Upgrade ncn-m001
.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
weave
and multus
Run the following command to complete the upgrade of the weave
and multus
manifest versions:
/srv/cray/scripts/common/apply-networking-manifests.sh
coredns
anti-affinityRun the following script to apply anti-affinity to coredns
pods:
/usr/share/doc/csm/upgrade/scripts/k8s/apply-coredns-pod-affinity.sh
Complete the Kubernetes upgrade. This script will restart several pods on each master node to their new Docker containers.
/usr/share/doc/csm/upgrade/scripts/k8s/upgrade_control_plane.sh
NOTE
:kubelet
has been upgraded already, ignore the warning to upgrade it.
ncn-m002
For any typescripts that were started during this stage on ncn-m002
, stop them with the exit
command.
All Kubernetes nodes have been rebooted into the new image.
REMINDER: If password for
ncn-m002
was reset during Stage 2.3, then also reset the password onncn-m001
at this time.
This stage is completed. Continue to Stage 3.