This section updates the software running on management NCNs.
management-nodes-rollout stage
goss-servers on all NCNsInstructions to perform Slingshot switch firmware updates are provided in the “Upgrade Slingshot Switch Firmware on HPE Cray EX” section of the HPE Slingshot Operations Guide.
Once this step has completed:
Refer to Update Non-Compute Node (NCN) BIOS and BMC Firmware for details on how to upgrade the firmware on management nodes.
Once this step has completed:
management-nodes-rollout stageThis section describes how to update software on management nodes. It describes how to test a new image and CFS configuration on a single node first to ensure they work as expected before rolling the changes out to the other management
nodes. This initial test node is referred to as the “canary node”. Modify the procedure as necessary to accommodate site preferences for rebuilding management nodes. The images and CFS configurations used are created by the
prepare-images and update-cfs-config stages respectively; see the prepare-images Artifacts created documentation for details on how to query the images and CFS configurations and see the
update-cfs-config documentation for details about how the CFS configuration is updated.
NOTE Additional arguments are available to control the behavior of the management-nodes-rollout stage, for example --limit-management-rollout and -cmrp. See the
management-nodes-rollout stage documentation for details and adjust the examples below if necessary.
IMPORTANT There is a different procedure for management-nodes-rollout depending on whether or not CSM is being upgraded. The two procedures differ in the handling of NCN master nodes. If CSM is not
being upgraded, then NCN master nodes will not be upgraded with new images and will be updated by the CFS configuration created in update-cfs-config only. If CSM is being
upgraded, the NCN master nodes will be upgraded with new images and the new CFS configuration. Both procedures use the same steps for rebuilding/upgrading NCN worker nodes and personalizing NCN storage nodes. Select one of the following
procedures based on whether or not CSM is being upgraded:
management-nodes-rollout with CSM upgradeNCN master nodes and NCN worker nodes will be upgraded to a new image because CSM itself is being upgraded. NCN master nodes, excluding ncn-m001, and NCN worker nodes will be upgraded with IUF.
ncn-m001 will be upgraded with manual commands.
NCN storage nodes are not upgraded as part of the CSM 1.3 to CSM 1.4 upgrade, but they will be personalized with a CFS configuration created during IUF.
This section describes how to test a new image and CFS configuration on a single canary node for NCN master nodes and NCN worker nodes first before rolling it out to the other NCN master nodes and NCN worker nodes.
Follow the steps below to upgrade NCN master and worker nodes and to personalize NCN storage nodes.
The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout stage.
Refer to that table and any corresponding product documents before continuing to the next step.
Personalize NCN storage nodes. Follow the procedure in section 3.4 Personalize NCN storage nodes and then return to this procedure to complete the next step.
Perform the NCN master node upgrade on ncn-m002 and ncn-m003.
Use kubectl to label ncn-m003 with iuf-prevent-rollout=true to ensure management-nodes-rollout only rebuilds the single NCN master node ncn-m002.
(ncn-m001#) Label ncn-m003 to prevent it from rebuilding.
kubectl label nodes "ncn-m003" --overwrite iuf-prevent-rollout=true
(ncn-m001#) Verify the IUF node label is present on the desired node.
kubectl get nodes --show-labels | grep iuf-prevent-rollout
Invoke iuf run with -r to execute the management-nodes-rollout stage on ncn-m002. This will rebuild ncn-m002 with the
new CFS configuration and image built in previous steps of the workflow.
NOTEIf Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryptiondirectory on the master node before upgrading and restore the directory after the node has been upgraded.
(ncn-m001#) Execute the management-nodes-rollout stage with ncn-m002.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout Management_Master
NOTEThe/etc/cray/kubernetes/encryptiondirectory should be restored if it was backed up. Once it is restored, thekube-apiserveron the rebuilt node should be restarted. See Kuberneteskube-apiserverFailing for details on how to restart thekube-apiserver.
Verify that ncn-m002 booted successfully with the desired image and CFS configuration.
Use kubectl to remove the iuf-prevent-rollout=true label from ncn-m003 and add it to ncn-m002.
(ncn-m001#) Remove label from ncn-m003 and add it to ncn-m002 to prevent it from rebuilding.
kubectl label nodes "ncn-m002" --overwrite iuf-prevent-rollout=true
kubectl label nodes "ncn-m003" --overwrite iuf-prevent-rollout-
(ncn-m001#) Verify the IUF node label is present on the desired node.
kubectl get nodes --show-labels | grep iuf-prevent-rollout
Invoke iuf run with -r to execute the management-nodes-rollout stage on ncn-m003. This will rebuild ncn-m003 with the new CFS configuration and image built in
previous steps of the workflow.
NOTEIf Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryptiondirectory on the master node before upgrading and restore the directory after the node has been upgraded.
(ncn-m001#) Execute the management-nodes-rollout stage with ncn-m003.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout Management_Master
NOTEThe/etc/cray/kubernetes/encryptiondirectory should be restored if it was backed up. Once it is restored, thekube-apiserveron the rebuilt node should be restarted.
Use kubectl to remove the iuf-prevent-rollout=true label from ncn-m002.
(ncn-m001#) Remove label from ncn-m002.
kubectl label nodes "ncn-m002" --overwrite iuf-prevent-rollout-
(ncn-m001#) Verify the IUF node label is no longer set on ncn-m002.
kubectl get nodes --show-labels | grep iuf-prevent-rollout
Perform the NCN worker node upgrade. To upgrade worker nodes, follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.
Upgrade ncn-m001.
Follow the steps documented in Stage 1.3 - ncn-m001 upgrade.
Stop before performing the specific upgrade ncn-m001 step and return to this document.
Set the CFS configuration on ncn-m001.
Get the image ID and CFS configuration created for management nodes during the prepare-images and update-cfs-config stages. Follow the instructions in the
prepare-images Artifacts created documentation to get the values for final_image_id and configuration with a
configuration_group_name value matching Management_Master. These values will be used in the following steps.
(ncn-m#) Set CFS_CONFIG_NAME to be the value for configuration found for Management_Master nodes in the the previous step.
CFS_CONFIG_NAME=<appropriate configuration value>
(ncn-m#) Get the xname of ncn-m001.
XNAME=$(ssh ncn-m001 'cat /etc/cray/xname')
echo "${XNAME}"
(ncn-m#) Set the CFS configuration on ncn-m001.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${XNAME}" --no-enable --no-clear-err
The expected output is:
All components updated successfully.
Set the image in BSS for ncn-m001 by following the Set NCN boot image for ncn-m001 and NCN storage nodes
section of the Management nodes rollout stage documentation.
Set the IMS_RESULTANT_IMAGE_ID variable to the final_image_id for Management_Master found in the previous step.
(ncn-m002#) Upgrade ncn-m001. This must be executed on ncn-m002.
NOTEIf Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryptiondirectory on the master node before upgrading and restore the directory after the node has been upgraded.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
NOTEThe/etc/cray/kubernetes/encryptiondirectory should be restored if it was backed up. Once it is restored, thekube-apiserveron the rebuilt node should be restarted. See Kuberneteskube-apiserverFailing for details on how to restart thekube-apiserver.
Follow the steps documented in Stage 1.4 - Upgrade weave and multus
Follow the steps documented in Stage 1.5 - coredns anti-affinity
Follow the steps documented in Stage 1.6 - Complete Kubernetes upgrade.
Once this step has completed:
management-nodes-rollout stageContinue to the next section 4. Restart goss-servers on all NCNs.
management-nodes-rollout without CSM upgradeThis is the procedure to rollout management nodes if CSM is not being upgraded. NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow.
Unlike NCN worker nodes, NCN master nodes and storage nodes do not contain kernel module content from non-CSM products. However, user-space non-CSM product content is still provided on NCN master nodes and storage nodes and thus the prepare-images and update-cfs-config
stages create a new image and CFS configuration for NCN master nodes and storage nodes. The CFS configuration layers ensure the non-CSM product content is applied correctly for both
image customization and node personalization scenarios. As a result, the administrator
can update NCN master and storage nodes using CFS configuration only.
Follow the following steps to complete the management-nodes-rollout stage.
The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout stage.
Refer to that table and any corresponding product documents before continuing to the next step.
Rebuild the NCN worker nodes. Follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.
Personalize NCN storage nodes. Follow the procedure in section 3.4 Personalize NCN storage nodes and then return to this procedure to complete the next step.
Personalize NCN master nodes.
(ncn-m#) Get a comma-separated list of the xnames for all NCN master nodes and verify they are correct.
MASTER_XNAMES=$(cray hsm state components list --role Management --subrole Master --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
echo "Master node xnames: ${MASTER_XNAMES}"
Get the CFS configuration created for management nodes during the prepare-images and update-cfs-config stages. Follow the instructions in the prepare-images Artifacts created
documentation to get the value for configuration for any image with a configuration_group_name value matching Management_Master,Management_Worker, or Management_Storage (since configuration is the same for all
management nodes).
(ncn-m#) Set CFS_CONFIG_NAME to the value for configuration found in the previous step.
CFS_CONFIG_NAME=<appropriate configuration value>
(ncn-m#) Apply the CFS configuration to NCN master nodes.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${MASTER_XNAMES}" --clear-state
The expected output is:
Configuration complete. 3 component(s) completed successfully. 0 component(s) failed.
Once this step has completed:
management-nodes-rollout stageContinue to the next section 4. Restart goss-servers on all NCNs.
NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow. This section describes how to test a new image and CFS configuration on a single canary node (ncn-w001) first before
rolling it out to the other NCN worker nodes. Modify the procedure as necessary to accommodate site preferences for rebuilding NCN worker nodes. Since the default node target for the management-nodes-rollout is Management_Worker
nodes, the --limit-management-rollout argument is not used in the instructions below.
The images and CFS configurations used are created by the prepare-images and update-cfs-config stages respectively; see the prepare-images Artifacts created documentation
for details on how to query the images and CFS configurations and see the update-cfs-config documentation for details about how the CFS configuration is updated.
NOTE The management-nodes-rollout stage creates additional separate Argo workflows when rebuilding NCN worker nodes. The Argo workflow names will include the string ncn-lifecycle-rebuild. If monitoring progress with the Argo UI,
remember to include these workflows.
The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout stage.
Refer to that table and any corresponding product documents before continuing to the next step.
Use kubectl to label all NCN worker nodes but one with iuf-prevent-rollout=true to ensure management-nodes-rollout only rebuilds a single NCN worker node. This node is referred to as the canary node in the remainder of
this section and the steps are documented with ncn-w001 as the canary node.
(ncn-m001#) Label a NCN to prevent it from rebuilding. Replace the example value of ${HOSTNAME} with the appropriate value. Repeat this step for all NCN worker nodes except for the canary node.
HOSTNAME=ncn-w002
kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout=true
(ncn-m001#) Verify the IUF node labels are present on the desired node.
kubectl get nodes --show-labels | grep iuf-prevent-rollout
Invoke iuf run with -r to execute the management-nodes-rollout stage on the canary node. This will rebuild the canary node with the new CFS configuration and image built in
previous steps of the workflow.
(ncn-m001#) Execute the management-nodes-rollout stage with a single NCN worker node.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout
Verify the canary node booted successfully with the desired image and CFS configuration.
Use kubectl to remove the iuf-prevent-rollout=true label from all NCN worker nodes and apply it to the canary node to prevent it from unnecessarily rebuilding again.
(ncn-m001#) Remove the label from a NCN to allow it to rebuild. Replace the example value of ${HOSTNAME} with the appropriate value. Repeat this step for all NCN worker nodes except for the canary node.
HOSTNAME=ncn-w002
kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout-
(ncn-m001#) Label the canary node to prevent it from rebuilding. Replace the example value of ${HOSTNAME} with the hostname of the canary node.
HOSTNAME=ncn-w001
kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout=true
Invoke iuf run with -r to execute the management-nodes-rollout stage on all remaining NCN worker nodes. This will rebuild the nodes with the new CFS configuration and
image built in previous steps of the workflow.
(ncn-m001#) Execute the management-nodes-rollout stage on all remaining worker and master nodes.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout
Use kubectl to remove the iuf-prevent-rollout=true label from the canary node. Replace the example value of ${HOSTNAME} with the hostname of the canary node.
HOSTNAME=ncn-w001
kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout-
Once this step has completed:
management-nodes-rollout stageReturn to the procedure that was being followed for management-nodes-rollout to complete the next step, either Management-nodes-rollout with CSM upgrade or
Management-nodes-rollout without CSM upgrade.
NOTEA customized image is created for NCN storage nodes during the prepare images stage. For the upgrade from CSM 1.3 to CSM 1.4, that image is the same image that is running on NCN storage nodes so there is no need to ‘upgrade’ into that image. However, if it is desired to rollout the NCN storage nodes with the customized image, this can be done by following upgrade NCN storage nodes into the customized image. This is not the recommended procedure. It is recommended to personalize the NCN storage nodes by following the steps below.
Personalize NCN storage nodes.
(ncn-m#) Get a comma-separated list of the xnames for all NCN storage nodes and verify they are correct.
STORAGE_XNAMES=$(cray hsm state components list --role Management --subrole Storage --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
echo "Storage node xnames: ${STORAGE_XNAMES}"
Get the CFS configuration created for management nodes during the update-cfs-config stage. Follow the instructions in the
prepare-images Artifacts created
documentation to get the value for configuration for images with a configuration_group_name value matching Management_Storage. This value will be needed in the following step.
(ncn-m#) Set CFS_CONFIG_NAME to the value for configuration found in the previous step.
CFS_CONFIG_NAME=<appropriate configuration value>
(ncn-m#) Apply the CFS configuration to NCN storage nodes.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${STORAGE_XNAMES}" --clear-state
The expected output is:
Configuration complete. 6 component(s) completed successfully. 0 component(s) failed.
Once this step has completed:
Return to the procedure that was being followed for management-nodes-rollout to complete the next step, either Management-nodes-rollout with CSM upgrade or
Management-nodes-rollout without CSM upgrade.
goss-servers on all NCNsThe goss-servers service needs to be restarted on all NCNs. This ensures the correct tests are run on each NCN. This is necessary due to a timing issue that is fixed in CSM 1.6.1.
(ncn-m001#) Restart goss-servers.
ncn_nodes=$(grep -oP "(ncn-s\w+|ncn-m\w+|ncn-w\w+)" /etc/hosts | sort -u | tr -t '\n' ',')
ncn_nodes=${ncn_nodes%,}
pdsh -S -b -w $ncn_nodes 'systemctl restart goss-servers'
Continue to the next section 5. Update management host Slingshot NIC firmware.
If new Slingshot NIC firmware was provided, refer to the “200Gbps NIC Firmware Management” section of the HPE Slingshot Operations Guide for details on how to update NIC firmware on management nodes.
After updating management host Slingshot NIC firmware, all nodes where the firmware was updated must be power cycled. Follow the reboot NCNs procedure for all nodes where the firmware was updated.
Once this step has completed:
deploy-product and post-install-service-check stagesIf performing an initial install or an upgrade of non-CSM products only, then return to the Install or upgrade additional products with IUF workflow to continue the install or upgrade.
If performing an upgrade that includes upgrading CSM, then return to the Upgrade CSM and additional products with IUF workflow to continue the upgrade.