Stage 3 - Kubernetes Upgrade

Reminder: If any problems are encountered and the procedure or command output does not provide relevant guidance, see Relevant troubleshooting links for upgrade-related issues.

Start typescript on ncn-m001

  1. (ncn-m001#) If a typescript session is already running in the shell, then first stop it with the exit command.

  2. (ncn-m001#) Start a typescript.

    script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_3_ncn-m001.txt
    export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
    

If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.

Stage 3.1 - Master node image upgrade

  1. (ncn-m001#) Run ncn-upgrade-master-nodes.sh for ncn-m002.

    Follow output of the script carefully. The script will pause for manual interaction.

    /usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m002
    

    NOTE The root user password for the node may need to be reset after it is rebooted.

  2. Repeat the previous step for each other master node excluding ncn-m001, one at a time.

Argo workflows

Before starting Stage 3.2 - Worker node image upgrade, access the Argo UI to view the progress of this stage. Note that the progress for the current stage will not show up in Argo before the worker node image upgrade script has been started.

For more information, see Using the Argo UI and Using Argo Workflows.

NOTE One of the Argo steps (wait-for-cfs) will prevent the upgrade of a worker node from proceeding if the CFS component status for that worker is in an Error state, and this must be fixed in order for the upgrade to continue. The following steps can be used to reset the component state in CFS (replace XNAME below with the XNAME for the worker node:

cray cfs components update --error-count 0 <XNAME>
cray cfs components update --state '[]' <XNAME>

Stage 3.2 - Worker node image upgrade

NOTE When upgrading worker nodes which are running DVS, it is not recommended to simultaneously reboot compute nodes. This is to avoid restarting DVS clients and servers at the same time.

There are two options available for upgrading worker nodes.

Option 1 - Serial upgrade

  1. (ncn-m001#) Run ncn-upgrade-worker-storage-nodes.sh for ncn-w001.

    Follow output of the script carefully. The script will pause for manual interaction.

    /usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w001
    

    NOTE The root user password for the node may need to be reset after it is rebooted.

  2. Repeat the previous steps for each other worker node, one at a time.

Option 2 - Parallel upgrade (Tech preview)

Multiple workers can be upgraded simultaneously by passing them as a comma-separated list into the upgrade script.

Restrictions

In some cases, it is not possible to upgrade all workers in one request. It is system administrator’s responsibility to make sure that the following conditions are met:

  • If the system has more than five workers, then they cannot all be upgraded with a single request.

    In this case, the upgrade should be split into multiple requests, with each request specifying no more than five workers.

  • No single upgrade request should include all of the worker nodes that have DVS running on them.

Example

(ncn-m001#) An example of a single request to upgrade multiple worker nodes simultaneously:

/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w002,ncn-w003,ncn-w004

Stage 3.3 - ncn-m001 upgrade

By this point, all NCNs have been upgraded, except for ncn-m001. In the upgrade process so far, ncn-m001 has been the “stable node” – that is, the node from which the other nodes were upgraded. At this point, the upgrade procedure pivots to use ncn-m002 as the new “stable node”, in order to allow the upgrade of ncn-m001.

Stop typescript on ncn-m001

For any typescripts that were started earlier on ncn-m001, stop them with the exit command.

Backup artifacts on ncn-m001

  1. (ncn-m001#) Create an archive of the artifacts.

    BACKUP_TARFILE="csm_upgrade.pre_m001_reboot_artifacts.$(date +%Y%m%d_%H%M%S).tgz"
    ls -d \
        /root/apply_csm_configuration.* \
        /root/csm_upgrade.* \
        /root/output.log 2>/dev/null |
    sed 's_^/__' |
    xargs tar -C / -czvf "/root/${BACKUP_TARFILE}"
    
  2. (ncn-m001#) Upload the archive to S3 in the cluster.

    cray artifacts create config-data "${BACKUP_TARFILE}" "/root/${BACKUP_TARFILE}"
    

Move to ncn-m002

  1. Log out of ncn-m001.

  2. Log in to ncn-m002 from outside the cluster.

    NOTE Very rarely, a password hash for the root user that works properly on a SLES SP2 NCN is not recognized on a SLES SP3 NCN. If password login fails, then log in to ncn-m002 from ncn-m001 and use the passwd command to reset the password. Then log in using the CMN IP address as directed below. Once ncn-m001 has been upgraded, log in from ncn-m002 and use the passwd command to reset the password. The other NCNs will have their passwords updated when NCN personalization is run in a subsequent step.

    ssh to the bond0.cmn0/CMN IP address of ncn-m002.

Start typescript on ncn-m002

  1. (ncn-m002#) Start a typescript.

    script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_3_ncn-m002.txt
    export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
    

Prepare ncn-m002

  1. Authenticate with the Cray CLI on ncn-m002.

    See Configure the Cray Command Line Interface for details on how to do this.

  2. (ncn-m002#) Set upgrade variables.

    source /etc/cray/upgrade/csm/myenv
    
  3. (ncn-m002#) Copy artifacts from ncn-m001.

    A later stage of the upgrade expects the docs-csm and libcsm RPMs to be located at /root/ on ncn-m002; that is why this command copies them there.

    • Install csi and docs-csm.

      scp ncn-m001:/root/csm_upgrade.pre_m001_reboot_artifacts.*.tgz /root
      zypper --plus-repo="/etc/cray/upgrade/csm/csm-${CSM_RELEASE}/tarball/csm-${CSM_RELEASE}/rpm/cray/csm/sle-$(awk -F= '/VERSION=/{gsub(/["-]/, "") ; print tolower($NF)}' /etc/os-release)" --no-gpg-checks install -y cray-site-init
      scp ncn-m001:/root/*.noarch.rpm /root/
      rpm -Uvh --force /root/docs-csm-latest.noarch.rpm
      
    • Install libcsm.

      NOTE Since libcsm depends on versions of Python relative to what is included in the SLES service packs, then in the event that ncn-m002 is running a newer SLES distro a new libcsm must be downloaded. This will often be the case when jumping to a new CSM minor version (e.g. CSM 1.3 to CSM 1.4). e.g. if ncn-m001 is running SLES15SP3, and ncn-m002 is running SLES15SP4 then the SLES15SP4 libcsm is needed. Follow the Check for latest documentation guide again, but from ncn-m002.

      rpm -Uvh --force /root/libcsm-latest.noarch.rpm
      

    If this step was executed as a result of the management-nodes-rollout with CSM upgrade instructions, return to that procedure and continue with the next step. Otherwise, if performing an upgrade of only CSM, proceed to the next step.

Upgrade ncn-m001

  1. Upgrade ncn-m001.

    /usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
    

Stage 3.4 - Upgrade weave and multus

Run the following command to complete the upgrade of the weave and multus manifest versions:

/srv/cray/scripts/common/apply-networking-manifests.sh

Stage 3.5 - coredns anti-affinity

Run the following script to apply anti-affinity to coredns pods:

/usr/share/doc/csm/upgrade/scripts/k8s/apply-coredns-pod-affinity.sh

Stage 3.6 - Complete Kubernetes upgrade

  • Complete the Kubernetes upgrade. This script will restart several pods on each master node to their new Docker containers.
/usr/share/doc/csm/upgrade/scripts/k8s/upgrade_control_plane.sh

NOTE: kubelet has been upgraded already, ignore the warning to upgrade it.

  • Uninstall the deprecated etcd-operator.
helm uninstall -n operators cray-etcd-operator

If this step was executed as part of the IUF Deploy Product steps, then return to the IUF Upgrade CSM and Additional Products with IUF and complete the remaining steps under Deploy Product. Otherwise, proceed to the following topic.

Stop typescript on ncn-m002

For any typescripts that were started during this stage on ncn-m002, stop them with the exit command.

Stage completed

All Kubernetes nodes have been rebooted into the new image.

REMINDER: If password for ncn-m002 was reset during Stage 3.3, then also reset the password on ncn-m001 at this time.

This stage is completed. Proceed to Validate CSM health during an upgrade