This section updates the software running on management NCNs.
management-nodes-rollout
stage
goss-servers
on all NCNsInstructions to perform Slingshot switch firmware updates are provided in the “Upgrade Slingshot Switch Firmware on HPE Cray EX” section of the HPE Slingshot Operations Guide.
Once this step has completed:
Refer to Update Non-Compute Node (NCN) BIOS and BMC Firmware for details on how to upgrade the firmware on management nodes.
Once this step has completed:
management-nodes-rollout
stageThis section describes how to update software on management nodes. It describes how to test a new image and CFS configuration on a single node first to ensure they work as expected before rolling the changes out to the other management
nodes. This initial test node is referred to as the “canary node”. Modify the procedure as necessary to accommodate site preferences for rebuilding management nodes. The images and CFS configurations used are created by the
prepare-images
and update-cfs-config
stages respectively; see the prepare-images
Artifacts created documentation for details on how to query the images and CFS configurations and see the
update-cfs-config documentation for details about how the CFS configuration is updated.
NOTE
Additional arguments are available to control the behavior of the management-nodes-rollout
stage, for example --limit-management-rollout
and -cmrp
. See the
management-nodes-rollout
stage documentation for details and adjust the examples below if necessary.
IMPORTANT
There is a different procedure for management-nodes-rollout
depending on whether or not CSM is being upgraded. The two procedures differ in the handling of NCN master nodes. If CSM is not
being upgraded, then NCN master nodes will not be upgraded with new images and will be updated by the CFS configuration created in update-cfs-config only. If CSM is being
upgraded, the NCN master nodes will be upgraded with new images and the new CFS configuration. Both procedures use the same steps for rebuilding/upgrading NCN worker nodes and personalizing NCN storage nodes. Select one of the following
procedures based on whether or not CSM is being upgraded:
management-nodes-rollout
with CSM upgradeNCN master nodes and NCN worker nodes will be upgraded to a new image because CSM itself is being upgraded. NCN master nodes, excluding ncn-m001
, and NCN worker nodes will be upgraded with IUF.
ncn-m001
will be upgraded with manual commands.
NCN storage nodes are not upgraded as part of the CSM 1.3 to CSM 1.4 upgrade, but they will be personalized with a CFS configuration created during IUF.
This section describes how to test a new image and CFS configuration on a single canary node for NCN master nodes and NCN worker nodes first before rolling it out to the other NCN master nodes and NCN worker nodes.
Follow the steps below to upgrade NCN master and worker nodes and to personalize NCN storage nodes.
The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout
stage.
Refer to that table and any corresponding product documents before continuing to the next step.
Personalize NCN storage nodes. Follow the procedure in section 3.4 Personalize NCN storage nodes and then return to this procedure to complete the next step.
Perform the NCN master node upgrade on ncn-m002
and ncn-m003
.
Use kubectl
to label ncn-m003
with iuf-prevent-rollout=true
to ensure management-nodes-rollout
only rebuilds the single NCN master node ncn-m002
.
(ncn-m001#
) Label ncn-m003
to prevent it from rebuilding.
kubectl label nodes "ncn-m003" --overwrite iuf-prevent-rollout=true
(ncn-m001#
) Verify the IUF node label is present on the desired node.
kubectl get nodes --show-labels | grep iuf-prevent-rollout
Invoke iuf run
with -r
to execute the management-nodes-rollout
stage on ncn-m002
. This will rebuild ncn-m002
with the
new CFS configuration and image built in previous steps of the workflow.
NOTE
If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryption
directory on the master node before upgrading and restore the directory after the node has been upgraded.
(ncn-m001#
) Execute the management-nodes-rollout
stage with ncn-m002
.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout Management_Master
NOTE
The/etc/cray/kubernetes/encryption
directory should be restored if it was backed up. Once it is restored, thekube-apiserver
on the rebuilt node should be restarted. See Kuberneteskube-apiserver
Failing for details on how to restart thekube-apiserver
.
Verify that ncn-m002
booted successfully with the desired image and CFS configuration.
Use kubectl
to remove the iuf-prevent-rollout=true
label from ncn-m003
and add it to ncn-m002
.
(ncn-m001#
) Remove label from ncn-m003
and add it to ncn-m002
to prevent it from rebuilding.
kubectl label nodes "ncn-m002" --overwrite iuf-prevent-rollout=true
kubectl label nodes "ncn-m003" --overwrite iuf-prevent-rollout-
(ncn-m001#
) Verify the IUF node label is present on the desired node.
kubectl get nodes --show-labels | grep iuf-prevent-rollout
Invoke iuf run
with -r
to execute the management-nodes-rollout
stage on ncn-m003
. This will rebuild ncn-m003
with the new CFS configuration and image built in
previous steps of the workflow.
NOTE
If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryption
directory on the master node before upgrading and restore the directory after the node has been upgraded.
(ncn-m001#
) Execute the management-nodes-rollout
stage with ncn-m003
.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout Management_Master
NOTE
The/etc/cray/kubernetes/encryption
directory should be restored if it was backed up. Once it is restored, thekube-apiserver
on the rebuilt node should be restarted.
Use kubectl
to remove the iuf-prevent-rollout=true
label from ncn-m002
.
(ncn-m001#
) Remove label from ncn-m002
.
kubectl label nodes "ncn-m002" --overwrite iuf-prevent-rollout-
(ncn-m001#
) Verify the IUF node label is no longer set on ncn-m002
.
kubectl get nodes --show-labels | grep iuf-prevent-rollout
Perform the NCN worker node upgrade. To upgrade worker nodes, follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.
Upgrade ncn-m001
.
Follow the steps documented in Stage 1.3 - ncn-m001
upgrade.
Stop before performing the specific upgrade ncn-m001
step and return to this document.
Set the CFS configuration on ncn-m001
.
Get the image ID and CFS configuration created for management nodes during the prepare-images
and update-cfs-config
stages. Follow the instructions in the
prepare-images
Artifacts created documentation to get the values for final_image_id
and configuration
with a
configuration_group_name
value matching Management_Master
. These values will be used in the following steps.
(ncn-m#
) Set CFS_CONFIG_NAME
to be the value for configuration
found for Management_Master
nodes in the the previous step.
CFS_CONFIG_NAME=<appropriate configuration value>
(ncn-m#
) Get the xname of ncn-m001
.
XNAME=$(ssh ncn-m001 'cat /etc/cray/xname')
echo "${XNAME}"
(ncn-m#
) Set the CFS configuration on ncn-m001
.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${XNAME}" --no-enable --no-clear-err
The expected output is:
All components updated successfully.
Set the image in BSS for ncn-m001
by following the Set NCN boot image for ncn-m001
and NCN storage nodes
section of the Management nodes rollout stage documentation.
Set the IMS_RESULTANT_IMAGE_ID
variable to the final_image_id
for Management_Master
found in the previous step.
(ncn-m002#
) Upgrade ncn-m001
. This must be executed on ncn-m002
.
NOTE
If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryption
directory on the master node before upgrading and restore the directory after the node has been upgraded.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
NOTE
The/etc/cray/kubernetes/encryption
directory should be restored if it was backed up. Once it is restored, thekube-apiserver
on the rebuilt node should be restarted. See Kuberneteskube-apiserver
Failing for details on how to restart thekube-apiserver
.
Follow the steps documented in Stage 1.4 - Upgrade weave
and multus
Follow the steps documented in Stage 1.5 - coredns
anti-affinity
Follow the steps documented in Stage 1.6 - Complete Kubernetes upgrade.
Once this step has completed:
management-nodes-rollout
stageContinue to the next section 4. Restart goss-servers
on all NCNs.
management-nodes-rollout
without CSM upgradeThis is the procedure to rollout management nodes if CSM is not being upgraded. NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow.
Unlike NCN worker nodes, NCN master nodes and storage nodes do not contain kernel module content from non-CSM products. However, user-space non-CSM product content is still provided on NCN master nodes and storage nodes and thus the prepare-images
and update-cfs-config
stages create a new image and CFS configuration for NCN master nodes and storage nodes. The CFS configuration layers ensure the non-CSM product content is applied correctly for both
image customization and node personalization scenarios. As a result, the administrator
can update NCN master and storage nodes using CFS configuration only.
Follow the following steps to complete the management-nodes-rollout
stage.
The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout
stage.
Refer to that table and any corresponding product documents before continuing to the next step.
Rebuild the NCN worker nodes. Follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.
Personalize NCN storage nodes. Follow the procedure in section 3.4 Personalize NCN storage nodes and then return to this procedure to complete the next step.
Personalize NCN master nodes.
(ncn-m#
) Get a comma-separated list of the xnames for all NCN master nodes and verify they are correct.
MASTER_XNAMES=$(cray hsm state components list --role Management --subrole Master --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
echo "Master node xnames: ${MASTER_XNAMES}"
Get the CFS configuration created for management nodes during the prepare-images
and update-cfs-config
stages. Follow the instructions in the prepare-images
Artifacts created
documentation to get the value for configuration
for any image with a configuration_group_name
value matching Management_Master
,Management_Worker
, or Management_Storage
(since configuration
is the same for all
management nodes).
(ncn-m#
) Set CFS_CONFIG_NAME
to the value for configuration
found in the previous step.
CFS_CONFIG_NAME=<appropriate configuration value>
(ncn-m#
) Apply the CFS configuration to NCN master nodes.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${MASTER_XNAMES}" --clear-state
The expected output is:
Configuration complete. 3 component(s) completed successfully. 0 component(s) failed.
Once this step has completed:
management-nodes-rollout
stageContinue to the next section 4. Restart goss-servers
on all NCNs.
NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow. This section describes how to test a new image and CFS configuration on a single canary node (ncn-w001
) first before
rolling it out to the other NCN worker nodes. Modify the procedure as necessary to accommodate site preferences for rebuilding NCN worker nodes. Since the default node target for the management-nodes-rollout
is Management_Worker
nodes, the --limit-management-rollout
argument is not used in the instructions below.
The images and CFS configurations used are created by the prepare-images
and update-cfs-config
stages respectively; see the prepare-images
Artifacts created documentation
for details on how to query the images and CFS configurations and see the update-cfs-config documentation for details about how the CFS configuration is updated.
NOTE
The management-nodes-rollout
stage creates additional separate Argo workflows when rebuilding NCN worker nodes. The Argo workflow names will include the string ncn-lifecycle-rebuild
. If monitoring progress with the Argo UI,
remember to include these workflows.
The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout
stage.
Refer to that table and any corresponding product documents before continuing to the next step.
Use kubectl
to label all NCN worker nodes but one with iuf-prevent-rollout=true
to ensure management-nodes-rollout
only rebuilds a single NCN worker node. This node is referred to as the canary node in the remainder of
this section and the steps are documented with ncn-w001
as the canary node.
(ncn-m001#
) Label a NCN to prevent it from rebuilding. Replace the example value of ${HOSTNAME}
with the appropriate value. Repeat this step for all NCN worker nodes except for the canary node.
HOSTNAME=ncn-w002
kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout=true
(ncn-m001#
) Verify the IUF node labels are present on the desired node.
kubectl get nodes --show-labels | grep iuf-prevent-rollout
Invoke iuf run
with -r
to execute the management-nodes-rollout
stage on the canary node. This will rebuild the canary node with the new CFS configuration and image built in
previous steps of the workflow.
(ncn-m001#
) Execute the management-nodes-rollout
stage with a single NCN worker node.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout
Verify the canary node booted successfully with the desired image and CFS configuration.
Use kubectl
to remove the iuf-prevent-rollout=true
label from all NCN worker nodes and apply it to the canary node to prevent it from unnecessarily rebuilding again.
(ncn-m001#
) Remove the label from a NCN to allow it to rebuild. Replace the example value of ${HOSTNAME}
with the appropriate value. Repeat this step for all NCN worker nodes except for the canary node.
HOSTNAME=ncn-w002
kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout-
(ncn-m001#
) Label the canary node to prevent it from rebuilding. Replace the example value of ${HOSTNAME}
with the hostname of the canary node.
HOSTNAME=ncn-w001
kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout=true
Invoke iuf run
with -r
to execute the management-nodes-rollout
stage on all remaining NCN worker nodes. This will rebuild the nodes with the new CFS configuration and
image built in previous steps of the workflow.
(ncn-m001#
) Execute the management-nodes-rollout
stage on all remaining worker and master nodes.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout
Use kubectl
to remove the iuf-prevent-rollout=true
label from the canary node. Replace the example value of ${HOSTNAME}
with the hostname of the canary node.
HOSTNAME=ncn-w001
kubectl label nodes "${HOSTNAME}" --overwrite iuf-prevent-rollout-
Once this step has completed:
management-nodes-rollout
stageReturn to the procedure that was being followed for management-nodes-rollout
to complete the next step, either Management-nodes-rollout with CSM upgrade or
Management-nodes-rollout without CSM upgrade.
NOTE
A customized image is created for NCN storage nodes during the prepare images stage. For the upgrade from CSM 1.3 to CSM 1.4, that image is the same image that is running on NCN storage nodes so there is no need to ‘upgrade’ into that image. However, if it is desired to rollout the NCN storage nodes with the customized image, this can be done by following upgrade NCN storage nodes into the customized image. This is not the recommended procedure. It is recommended to personalize the NCN storage nodes by following the steps below.
Personalize NCN storage nodes.
(ncn-m#
) Get a comma-separated list of the xnames for all NCN storage nodes and verify they are correct.
STORAGE_XNAMES=$(cray hsm state components list --role Management --subrole Storage --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
echo "Storage node xnames: ${STORAGE_XNAMES}"
Get the CFS configuration created for management nodes during the update-cfs-config
stage. Follow the instructions in the
prepare-images
Artifacts created
documentation to get the value for configuration
for images with a configuration_group_name
value matching Management_Storage
. This value will be needed in the following step.
(ncn-m#
) Set CFS_CONFIG_NAME
to the value for configuration
found in the previous step.
CFS_CONFIG_NAME=<appropriate configuration value>
(ncn-m#
) Apply the CFS configuration to NCN storage nodes.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${STORAGE_XNAMES}" --clear-state
The expected output is:
Configuration complete. 6 component(s) completed successfully. 0 component(s) failed.
Once this step has completed:
Return to the procedure that was being followed for management-nodes-rollout
to complete the next step, either Management-nodes-rollout with CSM upgrade or
Management-nodes-rollout without CSM upgrade.
goss-servers
on all NCNsThe goss-servers
service needs to be restarted on all NCNs. This ensures the correct tests are run on each NCN. This is necessary due to a timing issue that is fixed in CSM 1.6.1.
(ncn-m001#
) Restart goss-servers
.
ncn_nodes=$(grep -oP "(ncn-s\w+|ncn-m\w+|ncn-w\w+)" /etc/hosts | sort -u | tr -t '\n' ',')
ncn_nodes=${ncn_nodes%,}
pdsh -S -b -w $ncn_nodes 'systemctl restart goss-servers'
Continue to the next section 5. Update management host Slingshot NIC firmware.
If new Slingshot NIC firmware was provided, refer to the “200Gbps NIC Firmware Management” section of the HPE Slingshot Operations Guide for details on how to update NIC firmware on management nodes.
After updating management host Slingshot NIC firmware, all nodes where the firmware was updated must be power cycled. Follow the reboot NCNs procedure for all nodes where the firmware was updated.
Once this step has completed:
deploy-product
and post-install-service-check
stagesIf performing an initial install or an upgrade of non-CSM products only, return to the Install or upgrade additional products with IUF workflow to continue the install or upgrade.
If performing an upgrade that includes upgrading CSM, return to the Upgrade CSM and additional products with IUF workflow to continue the upgrade.