Management rollout

This section updates the software running on management NCNs.

1. Perform Slingshot switch firmware updates

Instructions to perform Slingshot switch firmware updates are provided in the “Upgrade Slingshot Switch Firmware in a CSM environment” section of the HPE Slingshot Operations Guide.

Once this step has completed:

  • Slingshot switch firmware has been updated

2. Update management host firmware (FAS)

Refer to Update Non-Compute Node (NCN) BIOS and BMC Firmware for details on how to upgrade the firmware on management nodes.

Once this step has completed:

  • Host firmware has been updated on management nodes

3. Execute the IUF management-nodes-rollout stage

This section describes how to update software on management nodes. It describes how to test a new image and CFS configuration on a single node first to ensure they work as expected before rolling the changes out to the other management nodes. This initial test node is referred to as the “canary node”. Modify the procedure as necessary to accommodate site preferences for rebuilding management nodes. The images and CFS configurations used are created by the prepare-images and update-cfs-config stages respectively; see the prepare-images Artifacts created documentation for details on how to query the images and CFS configurations and see the update-cfs-config documentation for details about how the CFS configuration is updated.

NOTE Additional arguments are available to control the behavior of the management-nodes-rollout stage, for example --limit-management-rollout and -cmrp. See the management-nodes-rollout stage documentation for details and adjust the examples below if necessary.

IMPORTANT There is a different procedure for management-nodes-rollout depending on whether or not CSM is being upgraded. The two procedures differ in the handling of NCN storage nodes and NCN master nodes. If CSM is not being upgraded, then NCN storage nodes and NCN master nodes will not be upgraded with new images and will be updated by the CFS configuration created in update-cfs-config only. If CSM is being upgraded, the NCN storage nodes and NCN master nodes will be upgraded with new images and the new CFS configuration. Both procedures use the same steps for rebuilding/upgrading NCN worker nodes. Select one of the following procedures based on whether or not CSM is being upgraded:

3.1 management-nodes-rollout with CSM upgrade

All management nodes will be upgraded to a new image because CSM itself is being upgraded. All management nodes, excluding ncn-m001, will be upgraded with IUF. ncn-m001 will be upgraded with manual commands. This section describes how to test a new image and CFS configuration on a single canary node first before rolling it out to the other management nodes of the same management type. Follow the steps below to upgrade all management nodes.

  1. The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product” section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout stage. Refer to that table and any corresponding product documents before continuing to the next step.

  2. Perform the NCN storage node upgrades. This upgrades a single storage node first to test the storage node image and then upgrades the remaining storage nodes.

    NOTE The management-nodes-rollout stage creates additional separate Argo workflows when rebuilding NCN storage nodes. The Argo workflow names will include the string ncn-lifecycle-rebuild. If monitoring progress with the Argo UI, remember to include these workflows.

    1. (ncn-m001#) Execute the management-nodes-rollout stage with a single NCN storage node.

      STORAGE_CANARY=ncn-s001
      
      iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ${STORAGE_CANARY}
      
    2. (ncn-m#) Verify that the storage canary node booted successfully with the desired CFS configuration.

      XNAME=$(ssh $STORAGE_CANARY 'cat /etc/cray/xname')
      echo "${XNAME}"
      cray cfs components describe "${XNAME}"
      

      The desired value for configuration_status is configured. If it is pending, then wait for the status to change to configured.

    3. (ncn-m001#) Upgrade the remaining NCN storage nodes once the first has upgraded successfully. This upgrades NCN storage nodes serially. Get the number of storage nodes based on the cluster and verify that it is correct. The storage canary node should not be in the list since it has already been upgraded. The list of storage nodes can be manually entered if it is not desired to upgrade all of the remaining storage nodes.

      STORAGE_NODES="$(ceph orch host ls | grep ncn-s | grep -v "$STORAGE_CANARY" | awk '{print $1}' | xargs echo)"
      echo "$STORAGE_NODES"
      
      iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ${STORAGE_NODES}
      
    4. (ncn-m001#) Verify that all storage nodes configured successfully.

      for ncn in $(cray hsm state components list --subrole Storage --type Node \
         --format json | jq -r .Components[].ID | grep b0n | sort); do cray cfs components describe \
         $ncn --format json | jq -r ' .id+" "+.desiredConfig+" status="+.configurationStatus'; done
      
  3. Perform the NCN master node upgrade on ncn-m002 and ncn-m003.

    NOTE If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the /etc/cray/kubernetes/encryption directory on the master node before upgrading and restore the directory after the node has been upgraded.

    1. Invoke iuf run with -r to execute the management-nodes-rollout stage on ncn-m002. This will rebuild ncn-m002 with the new CFS configuration and image built in previous steps of the workflow.

      (ncn-m001#) Execute the management-nodes-rollout stage with ncn-m002.

      iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ncn-m002
      

      NOTE The /etc/cray/kubernetes/encryption directory should be restored if it was backed up. Once it is restored, the kube-apiserver on the rebuilt node should be restarted. See Kubernetes kube-apiserver Failing for details on how to restart the kube-apiserver.

    2. Verify that ncn-m002 booted successfully with the desired image and CFS configuration.

      XNAME=$(ssh ncn-m002 'cat /etc/cray/xname')
      echo "${XNAME}"
      cray cfs components describe "${XNAME}"
      
    3. Invoke iuf run with -r to execute the management-nodes-rollout stage on ncn-m003. This will rebuild ncn-m003 with the new CFS configuration and image built in previous steps of the workflow.

      (ncn-m001#) Execute the management-nodes-rollout stage with ncn-m003.

      iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ncn-m003
      

      NOTE The /etc/cray/kubernetes/encryption directory should be restored if it was backed up. Once it is restored, the kube-apiserver on the rebuilt node should be restarted.

    4. Verify that ncn-m003 booted successfully with the desired image and CFS configuration.

      XNAME=$(ssh ncn-m003 'cat /etc/cray/xname')
      echo "${XNAME}"
      cray cfs components describe "${XNAME}"
      
  4. Perform the NCN worker node upgrade. To upgrade worker nodes, follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.

  5. Upgrade ncn-m001.

    1. Follow the steps documented in Stage 3.3 - ncn-m001 upgrade. Stop before performing the specific upgrade ncn-m001 step and return to this document.

    2. Get the image ID and CFS configuration created for NCN master nodes during the prepare-images and update-cfs-config stages. Follow the instructions in the prepare-images Artifacts created documentation to get the values for final_image_id and configuration for images with a configuration_group_name value matching Management_Master. These values will be needed for upgrading ncn-m001 in the following steps.

    3. Set the CFS configuration on ncn-m001.

      1. (ncn-m#) Set CFS_CONFIG_NAME to be the value for configuration found for Management_Master nodes in the the second step.

        CFS_CONFIG_NAME=<appropriate configuration value>
        
      2. (ncn-m#) Get the xname of ncn-m001.

        XNAME=$(ssh ncn-m001 'cat /etc/cray/xname')
        echo "${XNAME}"
        
      3. (ncn-m#) Set the CFS configuration on ncn-m001.

        /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
        --no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${XNAME}" --no-enable --no-clear-err
        

        The expected output is:

        All components updated successfully.
        
    4. Set the image in BSS for ncn-m001 by following the Set NCN boot image for ncn-m001 section of the Management nodes rollout stage documentation. Set the IMS_RESULTANT_IMAGE_ID variable to the final_image_id for Management_Master found in the second step.

    5. (ncn-m002#) Upgrade ncn-m001. This must be executed on ncn-m002.

      NOTE If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the /etc/cray/kubernetes/encryption directory on the master node before upgrading and restore the directory after the node has been upgraded.

      /usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
      

      NOTE The /etc/cray/kubernetes/encryption directory should be restored if it was backed up. Once it is restored, the kube-apiserver on the rebuilt node should be restarted. See Kubernetes kube-apiserver Failing for details on how to restart the kube-apiserver.

  6. Follow the steps documented in Stage 3.4 - Upgrade weave and multus

  7. Follow the steps documented in Stage 3.5 - coredns anti-affinity

Once this step has completed:

  • All management NCNs have been upgraded to the image and CFS configuration created in the previous steps of this workflow
  • Per-stage product hooks have executed for the management-nodes-rollout stage

Continue to the next section 4. Restart goss-servers on all NCNs.

3.2 management-nodes-rollout without CSM upgrade

This is the procedure to rollout management nodes if CSM is not being upgraded. NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow. Unlike NCN worker nodes, NCN master nodes and storage nodes do not contain kernel module content from non-CSM products. However, user-space non-CSM product content is still provided on NCN master nodes and storage nodes and thus the prepare-images and update-cfs-config stages create a new image and CFS configuration for NCN master nodes and storage nodes. The CFS configuration layers ensure the non-CSM product content is applied correctly for both image customization and node personalization scenarios. As a result, the administrator can update NCN master and storage nodes using CFS configuration only. Follow the following steps to complete the management-nodes-rollout stage.

  1. The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product” section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout stage. Refer to that table and any corresponding product documents before continuing to the next step.

  2. Rebuild the NCN worker nodes. Follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.

  3. Configure NCN master nodes.

    1. (ncn-m#) Create a comma-separated list of the xnames for all NCN master nodes and verify they are correct.

      MASTER_XNAMES=$(cray hsm state components list --role Management --subrole Master --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
      echo "Master node xnames: $MASTER_XNAMES"
      
    2. Get the CFS configuration created for management nodes during the prepare-images and update-cfs-config stages. Follow the instructions in the prepare-images Artifacts created documentation to get the value for configuration for the image with a configuration_group_name value matching Management_Master.

    3. (ncn-m#) Set CFS_CONFIG_NAME to the value for configuration found in the previous step.

      CFS_CONFIG_NAME=<appropriate configuration value>
      
    4. (ncn-m#) Apply the CFS configuration to NCN master nodes.

      /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
      --no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames $MASTER_XNAMES --clear-state
      

      Sample output for configuring multiple management nodes is:

      Taking snapshot of existing management-23.11.0 configuration to /root/apply_csm_configuration.20240305_173700.vKxhqC backup-management-23.11.0.json
      Setting desired configuration, clearing state, clearing error count, enabling components in CFS
      desiredConfig = "management-23.11.0"
      enabled = true
      errorCount = 0
      id = "x3700c0s16b0n0"
      state = []
      
      [tags]
      
      desiredConfig = "management-23.11.0"
      enabled = true
      errorCount = 0
      id = "x3701c0s16b0n0"
      state = []
      
      [tags]
      
      desiredConfig = "management-23.11.0"
      enabled = true
      errorCount = 0
      id = "x3702c0s16b0n0"
      state = []
      
      [tags]
      
      Waiting for configuration to complete. 3 components remaining.
      Configuration complete. 3 component(s) completed successfully.  0 component(s) failed.
      
  4. Configure NCN storage nodes.

    1. (ncn-m#) Create a comma-separated list of the xnames for all NCN storage nodes and verify they are correct.

      STORAGE_XNAMES=$(cray hsm state components list --role Management --subrole Storage --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
      echo "Storage node xnames: $STORAGE_XNAMES"
      
    2. Get the CFS configuration created for management storage nodes during the prepare-images and update-cfs-config stages. Follow the instructions in the prepare-images Artifacts created documentation to get the value for configuration for the image with a configuration_group_name value matching Management_Storage.

    3. (ncn-m#) Set CFS_CONFIG_NAME to the value for configuration found in the previous step.

      CFS_CONFIG_NAME=<appropriate configuration value>
      
    4. (ncn-m#) Apply the CFS configuration to NCN storage nodes.

      /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
      --no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames $STORAGE_XNAMES --clear-state
      

      Sample output for configuring multiple management nodes is:

      Taking snapshot of existing minimal-management-23.11.0 configuration to /root/apply_csm_configuration.20240305_173700.vKxhqC backup-minimal-management-23.11.0.json
      Setting desired configuration, clearing state, clearing error count, enabling components in CFS
      desiredConfig = "minimal-management-23.11.0"
      enabled = true
      errorCount = 0
      id = "x3700c0s16b0n0"
      state = []
      
      [tags]
      
      desiredConfig = "minimal-management-23.11.0"
      enabled = true
      errorCount = 0
      id = "x3701c0s16b0n0"
      state = []
      
      [tags]
      
      desiredConfig = "minimal-management-23.11.0"
      enabled = true
      errorCount = 0
      id = "x3702c0s16b0n0"
      state = []
      
      [tags]
      
      Waiting for configuration to complete. 3 components remaining.
      Configuration complete. 3 component(s) completed successfully.  0 component(s) failed.
      

Once this step has completed:

  • Management NCN worker nodes have been rebuilt with the image and CFS configuration created in previous steps of this workflow
  • Management NCN storage and NCN master nodes have be updated with the CFS configuration created in the previous steps of this workflow.
  • Per-stage product hooks have executed for the management-nodes-rollout stage

Continue to the next section 4. Restart goss-servers on all NCNs.

3.3 NCN worker nodes

NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow. This section describes how to test a new image and CFS configuration on a single canary node (ncn-w001) first before rolling it out to the other NCN worker nodes. Modify the procedure as necessary to accommodate site preferences for rebuilding NCN worker nodes.

The images and CFS configurations used are created by the prepare-images and update-cfs-config stages respectively; see the prepare-images Artifacts created documentation for details on how to query the images and CFS configurations and see the update-cfs-config documentation for details about how the CFS configuration is updated.

NOTE The management-nodes-rollout stage creates additional separate Argo workflows when rebuilding NCN worker nodes. The Argo workflow names will include the string ncn-lifecycle-rebuild. If monitoring progress with the Argo UI, remember to include these workflows.

NOTE If upgrading from CSM 1.4 to CSM 1.5 with a COS release prior to 2.5.146 currently installed, a workaround is needed to roll out the management nodes. See the later subsection 3.3.1 DVS workaround upgrading from COS prior to 2.5.146. If the installed COS version is 2.5.146 or later, this is not needed.

  1. The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product” section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout stage. Refer to that table and any corresponding product documents before continuing to the next step.

  2. (ncn-m001#) Execute the management-nodes-rollout stage with a single NCN worker node. This will rebuild the canary node with the new CFS configuration and image built in previous steps of the workflow. The worker canary node can be any worker node and does not have to be ncn-w001.

    WORKER_CANARY=ncn-w001
    
    iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ${WORKER_CANARY}
    
  3. Verify the canary node booted successfully with the desired image and CFS configuration.

    XNAME=$(ssh $WORKER_CANARY 'cat /etc/cray/xname')
    echo "${XNAME}"
    cray cfs components describe "${XNAME}"
    
  4. (ncn-m001#) Use kubectl to apply the iuf-prevent-rollout=true label to the canary node to prevent it from unnecessarily rebuilding again.

    kubectl label nodes "${WORKER_CANARY}" --overwrite iuf-prevent-rollout=true
    
  5. (ncn-m001#) Verify the IUF node labels are present on the desired node.

    kubectl get nodes --show-labels | grep iuf-prevent-rollout
    
  6. (ncn-m001#) Execute the management-nodes-rollout stage on all remaining worker nodes.

    NOTE For this step, the argument to --limit-management-rollout can be Management_Worker or a list of worker node names separated by spaces. If Management_Worker is supplied, all worker nodes that are not labeled with iuf-prevent-rollout=true will be rebuilt/upgraded. If a list of worker node names is supplied, then those worker nodes will be rebuilt/upgraded.

    Choose one of the following two options. The difference between the options is the limit-management-rollout argument, but the two options do the same thing.

    1. (ncn-m001#) Execute management-nodes-rollout on all Management_Worker nodes.

      iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout Management_Worker
      
    2. (ncn-m001#) Execute management-nodes-rollout on a group of worker nodes. The list of worker nodes can be manually edited if it is undesirable to rebuild/upgrade all of the workers with one execution.

      WORKER_NODES=$(kubectl get node | grep -P 'ncn-w\d+' | grep -v $WORKER_CANARY |  awk '{print $1}' | xargs)
      echo $WORKER_NODES
      
      iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout $WORKER_NODES
      
  7. (ncn-m001#) Use kubectl to remove the iuf-prevent-rollout=true label from the canary node.

    kubectl label nodes "${WORKER_CANARY}" --overwrite iuf-prevent-rollout-
    
  8. (ncn-m001#) Verify that all worker nodes configured successfully.

    for ncn in $(cray hsm state components list --subrole Worker --type Node \
      --format json | jq -r .Components[].ID | grep b0n | sort); do cray cfs components describe \
      $ncn --format json | jq -r ' .id+" "+.desiredConfig+" status="+.configurationStatus'; done
    

Once this step has completed:

  • Management NCN worker nodes have been rebuilt with the image and CFS configuration created in previous steps of this workflow
  • Per-stage product hooks have executed for the management-nodes-rollout stage

Return to the procedure that was being followed for management-nodes-rollout to complete the next step, either Management-nodes-rollout with CSM upgrade or Management-nodes-rollout without CSM upgrade.

3.3.1 DVS workaround upgrading from COS prior to 2.5.146

If COS prior to 2.5.146 is installed prior to upgrading to CSM 1.5, the management rollout in this step may hang. There is a workaround for this, copying the new version of the DVS prechecks_for_worker_reboots script to all NCN worker nodes as /opt/cray/shasta/cos/bin/prechecks_for_worker_reboots

This is to be run on the ncn-m001 node during this step 2.3 NCN worker nodes.

The new version of the script may be found in the cray-dvs-csm rpm in the USS CSM tar file in the upgrade’s media directory. The workaround is to extract the script from the rpm to a temporary directory and then copy it to the worker nodes.

It should be copied to the canary node when that node is being rebuilt, and to the remaining worker nodes after the canary node boot has succeeded.

  1. (ncn-m001#) Set an environment variable to the media directory, if not already set.

    echo $MEDIA_DIR
    
    MEDIA_DIR=/etc/cray/upgrade/csm/media/<directory>
    
  2. (ncn-m001#) Optionally, create and cd to a temporary directory. in which to extract the new version of the script.

    mkdir /tmp/upgrade-prechecks_WAR
    cd /tmp/upgrade-prechecks_WAR
    
  3. (ncn-m001#) Extract the cray-dvs-csm rpm that’s included in the USS image:

    rpm2cpio < $MEDIA_DIR/uss-*-csm-1.5/rpms/uss-*-csm-1.5/x86_64/cray-dvs-csm-*.x86_64.rpm | cpio -i --make-directories --no-absolute-filenames
    
  4. (ncn-m001#) Install the new version of the script onto all of the worker nodes. This is one way to do that:

    SSH_OPTIONS='-o StrictHostKeyChecking=no -o ConnectTimeout=15 -o LogLevel=ERROR -o UserKnownHostsFile=/dev/null'
    for name in $(kubectl get node | grep -P 'ncn-w\d+' | awk '{print $1}'); do
        scp -p $SSH_OPTIONS opt/cray/shasta/cne/bin/prechecks_for_worker_reboots $name:/opt/cray/shasta/cos/bin/prechecks_for_worker_reboots
    done
    
  5. (ncn-m001#) Optionally, remove the temporary directory.

    cd ..
    rm -rf upgrade-prechecks_WAR
    

After completing this workaround, return to 3.3 NCN worker nodes to roll out worker nodes.

4. Restart goss-servers on all NCNs

The goss-servers service needs to be restarted on all NCNs. This ensures the correct tests are run on each NCN. This is necessary due to a timing issue that is fixed in CSM 1.6.1.

(ncn-m001#) Restart goss-servers.

ncn_nodes=$(grep -oP "(ncn-s\w+|ncn-m\w+|ncn-w\w+)" /etc/hosts | sort -u | tr -t '\n' ',')
ncn_nodes=${ncn_nodes%,}
pdsh -S -b -w $ncn_nodes 'systemctl restart goss-servers'

Continue to the next section 5. Update ceph node-exporter config for SNMP counters.

5. Update ceph node-exporter config for SNMP counters

OPTIONAL: This is an optional step.

This uses netstat collector form node-exporter and enables all the SNMP counters monitoring in /proc/net/snmp on ncn nodes.

See Update ceph node-exporter configuration to update the ceph node-exporter configuration to monitor SNMP counters.

Continue to the next section 6. Update management host Slingshot NIC firmware.

6. Update management host Slingshot NIC firmware

If new Slingshot NIC firmware was provided, refer to the “200Gbps NIC Firmware Management” section of the HPE Slingshot Operations Guide for details on how to update NIC firmware on management nodes.

After updating management host Slingshot NIC firmware, all nodes where the firmware was updated must be power cycled. Follow the reboot NCNs procedure for all nodes where the firmware was updated.

Once this step has completed:

  • New versions of product microservices have been deployed
  • Service checks have been run to verify product microservices are executing as expected
  • Per-stage product hooks have executed for the deploy-product and post-install-service-check stages

7. Next steps