Management rollout

This section updates the software running on management NCNs.

1. Perform Slingshot switch firmware updates

Instructions to perform Slingshot switch firmware updates are provided in the “Upgrade Slingshot Switch Firmware on HPE Cray EX” section of the HPE Slingshot Operations Guide.

Once this step has completed:

  • Slingshot switch firmware has been updated

2. Update management host firmware (FAS)

Refer to Update Non-Compute Node (NCN) BIOS and BMC Firmware for details on how to upgrade the firmware on management nodes.

Once this step has completed:

  • Host firmware has been updated on management nodes

3. Execute the IUF management-nodes-rollout stage

This section describes how to update software on management nodes. It describes how to test a new image and CFS configuration on a single node first to ensure they work as expected before rolling the changes out to the other management nodes. This initial test node is referred to as the “canary node”. Modify the procedure as necessary to accommodate site preferences for rebuilding management nodes. The images and CFS configurations used are created by the prepare-images and update-cfs-config stages respectively; see the prepare-images Artifacts created documentation for details on how to query the images and CFS configurations and see the update-cfs-config documentation for details about how the CFS configuration is updated.

NOTE Additional arguments are available to control the behavior of the management-nodes-rollout stage, for example --limit-management-rollout and -cmrp. See the management-nodes-rollout stage documentation for details and adjust the examples below if necessary.

IMPORTANT There is a different procedure for management-nodes-rollout depending on whether or not CSM is being upgraded. The two procedures differ in the handling of NCN master nodes. If CSM is not being upgraded, then NCN master nodes will not be upgraded with new images and will be updated by the CFS configuration created in update-cfs-config only. If CSM is being upgraded, the NCN master nodes will be upgraded with new images and the new CFS configuration. Both procedures use the same steps for rebuilding/upgrading NCN worker nodes and personalizing NCN storage nodes. Select one of the following procedures based on whether or not CSM is being upgraded:

3.1 management-nodes-rollout with CSM upgrade

NCN master nodes and NCN worker nodes will be upgraded to a new image because CSM itself is being upgraded. NCN master nodes, excluding ncn-m001, and NCN worker nodes will be upgraded with IUF. ncn-m001 will be upgraded with manual commands. NCN storage nodes are not upgraded as part of the CSM 1.3 to CSM 1.4 upgrade, but they will be personalized with a CFS configuration created during IUF. This section describes how to test a new image and CFS configuration on a single canary node for NCN master nodes and NCN worker nodes first before rolling it out to the other NCN master nodes and NCN worker nodes. Follow the steps below to upgrade NCN master and worker nodes and to personalize NCN storage nodes.

  1. The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product” section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout stage. Refer to that table and any corresponding product documents before continuing to the next step.

  2. Personalize NCN storage nodes. Follow the procedure in section 3.4 Personalize NCN storage nodes and then return to this procedure to complete the next step.

  3. Perform the NCN master node upgrade on ncn-m002 and ncn-m003.

    1. Use kubectl to label ncn-m003 with iuf-prevent-rollout=true to ensure management-nodes-rollout only rebuilds the single NCN master node ncn-m002.

      (ncn-m001#) Label ncn-m003 to prevent it from rebuilding.

      kubectl label nodes "ncn-m003" --overwrite iuf-prevent-rollout=true
      

      (ncn-m001#) Verify the IUF node label is present on the desired node.

      kubectl get nodes --show-labels | grep iuf-prevent-rollout
      
    2. Invoke iuf run with -r to execute the management-nodes-rollout stage on ncn-m002. This will rebuild ncn-m002 with the new CFS configuration and image built in previous steps of the workflow.

      NOTE If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the /etc/cray/kubernetes/encryption directory on the master node before upgrading and restore the directory after the node has been upgraded.

      (ncn-m001#) Execute the management-nodes-rollout stage with ncn-m002.

      iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout Management_Master
      

      NOTE The /etc/cray/kubernetes/encryption directory should be restored if it was backed up. Once it is restored, the kube-apiserver on the rebuilt node should be restarted. See Kubernetes kube-apiserver Failing for details on how to restart the kube-apiserver.

    3. Verify that ncn-m002 booted successfully with the desired image and CFS configuration.

    4. Use kubectl to remove the iuf-prevent-rollout=true label from ncn-m003 and add it to ncn-m002.

      (ncn-m001#) Remove label from ncn-m003 and add it to ncn-m002 to prevent it from rebuilding.

      kubectl label nodes "ncn-m002" --overwrite iuf-prevent-rollout=true
      kubectl label nodes "ncn-m003" --overwrite iuf-prevent-rollout-
      

      (ncn-m001#) Verify the IUF node label is present on the desired node.

      kubectl get nodes --show-labels | grep iuf-prevent-rollout
      
    5. Invoke iuf run with -r to execute the management-nodes-rollout stage on ncn-m003. This will rebuild ncn-m003 with the new CFS configuration and image built in previous steps of the workflow.

      NOTE If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the /etc/cray/kubernetes/encryption directory on the master node before upgrading and restore the directory after the node has been upgraded.

      (ncn-m001#) Execute the management-nodes-rollout stage with ncn-m003.

      iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout Management_Master
      

      NOTE The /etc/cray/kubernetes/encryption directory should be restored if it was backed up. Once it is restored, the kube-apiserver on the rebuilt node should be restarted.

    6. Use kubectl to remove the iuf-prevent-rollout=true label from ncn-m002.

      (ncn-m001#) Remove label from ncn-m002.

      kubectl label nodes "ncn-m002" --overwrite iuf-prevent-rollout-
      

      (ncn-m001#) Verify the IUF node label is no longer set on ncn-m002.

      kubectl get nodes --show-labels | grep iuf-prevent-rollout
      
  4. Perform the NCN worker node upgrade. To upgrade worker nodes, follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.

  5. Upgrade ncn-m001.

    1. Follow the steps documented in Stage 1.3 - ncn-m001 upgrade. Stop before performing the specific upgrade ncn-m001 step and return to this document.

    2. Set the CFS configuration on ncn-m001.

      1. Get the image ID and CFS configuration created for management nodes during the prepare-images and update-cfs-config stages. Follow the instructions in the prepare-images Artifacts created documentation to get the values for final_image_id and configuration with a configuration_group_name value matching Management_Master. These values will be used in the following steps.

      2. (ncn-m#) Set CFS_CONFIG_NAME to be the value for configuration found for Management_Master nodes in the the previous step.

        CFS_CONFIG_NAME=<appropriate configuration value>
        
      3. (ncn-m#) Get the xname of ncn-m001.

        XNAME=$(ssh ncn-m001 'cat /etc/cray/xname')
        echo "${XNAME}"
        
      4. (ncn-m#) Set the CFS configuration on ncn-m001.

        /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
        --no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${XNAME}" --no-enable --no-clear-err
        

        The expected output is:

        All components updated successfully.
        
    3. Set the image in BSS for ncn-m001 by following the Set NCN boot image for ncn-m001 and NCN storage nodes section of the Management nodes rollout stage documentation. Set the IMS_RESULTANT_IMAGE_ID variable to the final_image_id for Management_Master found in the previous step.

    4. (ncn-m002#) Upgrade ncn-m001. This must be executed on ncn-m002.

      NOTE If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the /etc/cray/kubernetes/encryption directory on the master node before upgrading and restore the directory after the node has been upgraded.

      /usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
      

      NOTE The /etc/cray/kubernetes/encryption directory should be restored if it was backed up. Once it is restored, the kube-apiserver on the rebuilt node should be restarted. See Kubernetes kube-apiserver Failing for details on how to restart the kube-apiserver.

  6. Follow the steps documented in Stage 1.4 - Upgrade weave and multus

  7. Follow the steps documented in Stage 1.5 - coredns anti-affinity

  8. Follow the steps documented in Stage 1.6 - Complete Kubernetes upgrade.

Once this step has completed:

  • NCN master nodes and NCN worker nodes have been upgraded to the image and CFS configuration created in the previous steps of this workflow. NCN storage nodes have been personalized.
  • Per-stage product hooks have executed for the management-nodes-rollout stage

Continue to the next section 4. Restart goss-servers on all NCNs.

3.2 management-nodes-rollout without CSM upgrade

This is the procedure to rollout management nodes if CSM is not being upgraded. NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow. Unlike NCN worker nodes, NCN master nodes and storage nodes do not contain kernel module content from non-CSM products. However, user-space non-CSM product content is still provided on NCN master nodes and storage nodes and thus the prepare-images and update-cfs-config stages create a new image and CFS configuration for NCN master nodes and storage nodes. The CFS configuration layers ensure the non-CSM product content is applied correctly for both image customization and node personalization scenarios. As a result, the administrator can update NCN master and storage nodes using CFS configuration only. Follow the following steps to complete the management-nodes-rollout stage.

  1. The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product” section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout stage. Refer to that table and any corresponding product documents before continuing to the next step.

  2. Rebuild the NCN worker nodes. Follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.

  3. Personalize NCN storage nodes. Follow the procedure in section 3.4 Personalize NCN storage nodes and then return to this procedure to complete the next step.

  4. Personalize NCN master nodes.

    1. (ncn-m#) Get a comma-separated list of the xnames for all NCN master nodes and verify they are correct.

      MASTER_XNAMES=$(cray hsm state components list --role Management --subrole Master --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
      echo "Master node xnames: ${MASTER_XNAMES}"
      
    2. Get the CFS configuration created for management nodes during the prepare-images and update-cfs-config stages. Follow the instructions in the prepare-images Artifacts created documentation to get the value for configuration for any image with a configuration_group_name value matching Management_Master,Management_Worker, or Management_Storage (since configuration is the same for all management nodes).

    3. (ncn-m#) Set CFS_CONFIG_NAME to the value for configuration found in the previous step.

      CFS_CONFIG_NAME=<appropriate configuration value>
      
    4. (ncn-m#) Apply the CFS configuration to NCN master nodes.

      /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
      --no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${MASTER_XNAMES}" --clear-state
      

      The expected output is:

      Configuration complete. 3 component(s) completed successfully.  0 component(s) failed.
      

Once this step has completed:

  • Management NCN worker nodes have been rebuilt with the image and CFS configuration created in previous steps of this workflow
  • Management NCN storage and NCN master nodes have be updated with the CFS configuration created in the previous steps of this workflow.
  • Per-stage product hooks have executed for the management-nodes-rollout stage

Continue to the next section 4. Restart goss-servers on all NCNs.

3.3 NCN worker nodes

NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow. This section describes how to test a new image and CFS configuration on a single canary node (ncn-w001) first before rolling it out to the other NCN worker nodes. Modify the procedure as necessary to accommodate site preferences for rebuilding NCN worker nodes. Since the default node target for the management-nodes-rollout is Management_Worker nodes, the --limit-management-rollout argument is not used in the instructions below.

The images and CFS configurations used are created by the prepare-images and update-cfs-config stages respectively; see the prepare-images Artifacts created documentation for details on how to query the images and CFS configurations and see the update-cfs-config documentation for details about how the CFS configuration is updated.

NOTE The management-nodes-rollout stage creates additional separate Argo workflows when rebuilding NCN worker nodes. The Argo workflow names will include the string ncn-lifecycle-rebuild. If monitoring progress with the Argo UI, remember to include these workflows.

  1. The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product” section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout stage. Refer to that table and any corresponding product documents before continuing to the next step.

  2. Use kubectl to label all NCN worker nodes but one with iuf-prevent-rollout=true to ensure management-nodes-rollout only rebuilds a single NCN worker node. This node is referred to as the canary node in the remainder of this section and the steps are documented with ncn-w001 as the canary node.

    (ncn-m001#) Label a NCN to prevent it from rebuilding. Replace the example value of ${HOSTNAME} with the appropriate value. Repeat this step for all NCN worker nodes except for the canary node.

    HOSTNAME=ncn-w002
    kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout=true
    

    (ncn-m001#) Verify the IUF node labels are present on the desired node.

    kubectl get nodes --show-labels | grep iuf-prevent-rollout
    
  3. Invoke iuf run with -r to execute the management-nodes-rollout stage on the canary node. This will rebuild the canary node with the new CFS configuration and image built in previous steps of the workflow.

    (ncn-m001#) Execute the management-nodes-rollout stage with a single NCN worker node.

    iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout
    
  4. Verify the canary node booted successfully with the desired image and CFS configuration.

  5. Use kubectl to remove the iuf-prevent-rollout=true label from all NCN worker nodes and apply it to the canary node to prevent it from unnecessarily rebuilding again.

    (ncn-m001#) Remove the label from a NCN to allow it to rebuild. Replace the example value of ${HOSTNAME} with the appropriate value. Repeat this step for all NCN worker nodes except for the canary node.

    HOSTNAME=ncn-w002
    kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout-
    

    (ncn-m001#) Label the canary node to prevent it from rebuilding. Replace the example value of ${HOSTNAME} with the hostname of the canary node.

    HOSTNAME=ncn-w001
    kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout=true
    
  6. Invoke iuf run with -r to execute the management-nodes-rollout stage on all remaining NCN worker nodes. This will rebuild the nodes with the new CFS configuration and image built in previous steps of the workflow.

    (ncn-m001#) Execute the management-nodes-rollout stage on all remaining worker and master nodes.

    iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout
    
  7. Use kubectl to remove the iuf-prevent-rollout=true label from the canary node. Replace the example value of ${HOSTNAME} with the hostname of the canary node.

    HOSTNAME=ncn-w001
    kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout-
    

Once this step has completed:

  • Management NCN worker nodes have been rebuilt with the image and CFS configuration created in previous steps of this workflow
  • Per-stage product hooks have executed for the management-nodes-rollout stage

Return to the procedure that was being followed for management-nodes-rollout to complete the next step, either Management-nodes-rollout with CSM upgrade or Management-nodes-rollout without CSM upgrade.

3.4 Personalize NCN Storage Nodes

NOTE A customized image is created for NCN storage nodes during the prepare images stage. For the upgrade from CSM 1.3 to CSM 1.4, that image is the same image that is running on NCN storage nodes so there is no need to ‘upgrade’ into that image. However, if it is desired to rollout the NCN storage nodes with the customized image, this can be done by following upgrade NCN storage nodes into the customized image. This is not the recommended procedure. It is recommended to personalize the NCN storage nodes by following the steps below.

  1. Personalize NCN storage nodes.

    1. (ncn-m#) Get a comma-separated list of the xnames for all NCN storage nodes and verify they are correct.

      STORAGE_XNAMES=$(cray hsm state components list --role Management --subrole Storage --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
      echo "Storage node xnames: ${STORAGE_XNAMES}"
      
    2. Get the CFS configuration created for management nodes during the update-cfs-config stage. Follow the instructions in the prepare-images Artifacts created documentation to get the value for configuration for images with a configuration_group_name value matching Management_Storage. This value will be needed in the following step.

    3. (ncn-m#) Set CFS_CONFIG_NAME to the value for configuration found in the previous step.

      CFS_CONFIG_NAME=<appropriate configuration value>
      
    4. (ncn-m#) Apply the CFS configuration to NCN storage nodes.

      /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
      --no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${STORAGE_XNAMES}" --clear-state
      

      The expected output is:

      Configuration complete. 6 component(s) completed successfully.  0 component(s) failed.
      

Once this step has completed:

  • NCN storage nodes have been updated with the CFS configuration created during update-CFS-config.

Return to the procedure that was being followed for management-nodes-rollout to complete the next step, either Management-nodes-rollout with CSM upgrade or Management-nodes-rollout without CSM upgrade.

4. Restart goss-servers on all NCNs

The goss-servers service needs to be restarted on all NCNs. This ensures the correct tests are run on each NCN. This is necessary due to a timing issue that is fixed in CSM 1.6.1.

(ncn-m001#) Restart goss-servers.

ncn_nodes=$(grep -oP "(ncn-s\w+|ncn-m\w+|ncn-w\w+)" /etc/hosts | sort -u | tr -t '\n' ',')
ncn_nodes=${ncn_nodes%,}
pdsh -S -b -w $ncn_nodes 'systemctl restart goss-servers'

Continue to the next section 5. Update management host Slingshot NIC firmware.

5. Update management host Slingshot NIC firmware

If new Slingshot NIC firmware was provided, refer to the “200Gbps NIC Firmware Management” section of the HPE Slingshot Operations Guide for details on how to update NIC firmware on management nodes.

After updating management host Slingshot NIC firmware, all nodes where the firmware was updated must be power cycled. Follow the reboot NCNs procedure for all nodes where the firmware was updated.

Once this step has completed:

  • New versions of product microservices have been deployed
  • Service checks have been run to verify product microservices are executing as expected
  • Per-stage product hooks have executed for the deploy-product and post-install-service-check stages

6. Next steps