This section updates the software running on management NCNs.
management-nodes-rollout
stage
goss-servers
on all NCNsInstructions to perform Slingshot switch firmware updates are provided in the “Upgrade Slingshot Switch Firmware in a CSM environment” section of the HPE Slingshot Operations Guide.
Once this step has completed:
Refer to Update Non-Compute Node (NCN) BIOS and BMC Firmware for details on how to upgrade the firmware on management nodes.
Once this step has completed:
management-nodes-rollout
stageThis section describes how to update software on management nodes. It describes how to test a new image and CFS configuration on a single node first to ensure they work as expected before rolling the changes out to the other management
nodes. This initial test node is referred to as the “canary node”. Modify the procedure as necessary to accommodate site preferences for rebuilding management nodes. The images and CFS configurations used are created by the
prepare-images
and update-cfs-config
stages respectively; see the prepare-images
Artifacts created documentation for details on how to query the images and CFS configurations and see the
update-cfs-config documentation for details about how the CFS configuration is updated.
NOTE
Additional arguments are available to control the behavior of the management-nodes-rollout
stage, for example --limit-management-rollout
and -cmrp
. See the
management-nodes-rollout
stage documentation for details and adjust the examples below if necessary.
IMPORTANT
There is a different procedure for management-nodes-rollout
depending on whether or not CSM is being upgraded. The two procedures differ in the handling of NCN storage nodes and NCN master nodes. If CSM is not
being upgraded, then NCN storage nodes and NCN master nodes will not be upgraded with new images and will be updated by the CFS configuration created in update-cfs-config only. If CSM is being
upgraded, the NCN storage nodes and NCN master nodes will be upgraded with new images and the new CFS configuration. Both procedures use the same steps for rebuilding/upgrading NCN worker nodes. Select one of the following
procedures based on whether or not CSM is being upgraded:
management-nodes-rollout
with CSM upgradeAll management nodes will be upgraded to a new image because CSM itself is being upgraded. All management nodes, excluding ncn-m001
, will be upgraded with IUF.
ncn-m001
will be upgraded with manual commands.
This section describes how to test a new image and CFS configuration on a single canary node first before rolling it out to the other management nodes of the same management type.
Follow the steps below to upgrade all management nodes.
The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout
stage.
Refer to that table and any corresponding product documents before continuing to the next step.
Perform the NCN storage node upgrades. This upgrades a single storage node first to test the storage node image and then upgrades the remaining storage nodes.
NOTE
The management-nodes-rollout
stage creates additional separate Argo workflows when rebuilding NCN storage nodes. The Argo workflow names will include the string ncn-lifecycle-rebuild
.
If monitoring progress with the Argo UI, remember to include these workflows.
(ncn-m001#
) Execute the management-nodes-rollout
stage with a single NCN storage node.
STORAGE_CANARY=ncn-s001
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ${STORAGE_CANARY}
(ncn-m#
) Verify that the storage canary node booted successfully with the desired CFS configuration.
XNAME=$(ssh $STORAGE_CANARY 'cat /etc/cray/xname')
echo "${XNAME}"
cray cfs components describe "${XNAME}"
The desired value for configuration_status
is configured
. If it is pending
, then wait for the status to change to configured
.
(ncn-m001#
) Upgrade the remaining NCN storage nodes once the first has upgraded successfully. This upgrades NCN storage nodes serially.
Get the number of storage nodes based on the cluster and verify that it is correct. The storage canary node should not be in the list since it has already been upgraded.
The list of storage nodes can be manually entered if it is not desired to upgrade all of the remaining storage nodes.
STORAGE_NODES="$(ceph orch host ls | grep ncn-s | grep -v "$STORAGE_CANARY" | awk '{print $1}' | xargs echo)"
echo "$STORAGE_NODES"
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ${STORAGE_NODES}
(ncn-m001#
) Verify that all storage nodes configured successfully.
for ncn in $(cray hsm state components list --subrole Storage --type Node \
--format json | jq -r .Components[].ID | grep b0n | sort); do cray cfs components describe \
$ncn --format json | jq -r ' .id+" "+.desiredConfig+" status="+.configurationStatus'; done
Perform the NCN master node upgrade on ncn-m002
and ncn-m003
.
NOTE
If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryption
directory on the master node before upgrading and restore the directory after the node has been upgraded.
Invoke iuf run
with -r
to execute the management-nodes-rollout
stage on ncn-m002
. This will rebuild ncn-m002
with the new CFS configuration and image built in
previous steps of the workflow.
(ncn-m001#
) Execute the management-nodes-rollout
stage with ncn-m002
.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ncn-m002
NOTE
The/etc/cray/kubernetes/encryption
directory should be restored if it was backed up. Once it is restored, thekube-apiserver
on the rebuilt node should be restarted. See Kuberneteskube-apiserver
Failing for details on how to restart thekube-apiserver
.
Verify that ncn-m002
booted successfully with the desired image and CFS configuration.
XNAME=$(ssh ncn-m002 'cat /etc/cray/xname')
echo "${XNAME}"
cray cfs components describe "${XNAME}"
Invoke iuf run
with -r
to execute the management-nodes-rollout
stage on ncn-m003
. This will rebuild ncn-m003
with the new CFS configuration and image built in
previous steps of the workflow.
(ncn-m001#
) Execute the management-nodes-rollout
stage with ncn-m003
.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ncn-m003
NOTE
The/etc/cray/kubernetes/encryption
directory should be restored if it was backed up. Once it is restored, thekube-apiserver
on the rebuilt node should be restarted.
Verify that ncn-m003
booted successfully with the desired image and CFS configuration.
XNAME=$(ssh ncn-m003 'cat /etc/cray/xname')
echo "${XNAME}"
cray cfs components describe "${XNAME}"
Perform the NCN worker node upgrade. To upgrade worker nodes, follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.
Upgrade ncn-m001
.
Follow the steps documented in Stage 3.3 - ncn-m001
upgrade.
Stop before performing the specific upgrade ncn-m001
step and return to this document.
Get the image ID and CFS configuration created for NCN master nodes during the prepare-images
and update-cfs-config
stages. Follow the instructions in the
prepare-images
Artifacts created documentation to get the values for final_image_id
and configuration
for images with a configuration_group_name
value matching Management_Master
.
These values will be needed for upgrading ncn-m001
in the following steps.
Set the CFS configuration on ncn-m001
.
(ncn-m#
) Set CFS_CONFIG_NAME
to be the value for configuration
found for Management_Master
nodes in the the second step.
CFS_CONFIG_NAME=<appropriate configuration value>
(ncn-m#
) Get the xname of ncn-m001
.
XNAME=$(ssh ncn-m001 'cat /etc/cray/xname')
echo "${XNAME}"
(ncn-m#
) Set the CFS configuration on ncn-m001
.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames "${XNAME}" --no-enable --no-clear-err
The expected output is:
All components updated successfully.
Set the image in BSS for ncn-m001
by following the Set NCN boot image for ncn-m001
section of the Management nodes rollout stage documentation.
Set the IMS_RESULTANT_IMAGE_ID
variable to the final_image_id
for Management_Master
found in the second step.
(ncn-m002#
) Upgrade ncn-m001
. This must be executed on ncn-m002
.
NOTE
If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the/etc/cray/kubernetes/encryption
directory on the master node before upgrading and restore the directory after the node has been upgraded.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
NOTE
The/etc/cray/kubernetes/encryption
directory should be restored if it was backed up. Once it is restored, thekube-apiserver
on the rebuilt node should be restarted. See Kuberneteskube-apiserver
Failing for details on how to restart thekube-apiserver
.
Follow the steps documented in Stage 3.4 - Upgrade weave
and multus
Follow the steps documented in Stage 3.5 - coredns
anti-affinity
Once this step has completed:
management-nodes-rollout
stageContinue to the next section 4. Restart goss-servers
on all NCNs.
management-nodes-rollout
without CSM upgradeThis is the procedure to rollout management nodes if CSM is not being upgraded. NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow.
Unlike NCN worker nodes, NCN master nodes and storage nodes do not contain kernel module content from non-CSM products. However, user-space non-CSM product content is still provided on NCN master nodes and storage nodes and thus the prepare-images
and update-cfs-config
stages create a new image and CFS configuration for NCN master nodes and storage nodes. The CFS configuration layers ensure the non-CSM product content is applied correctly for both
image customization and node personalization scenarios. As a result, the administrator
can update NCN master and storage nodes using CFS configuration only.
Follow the following steps to complete the management-nodes-rollout
stage.
The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout
stage.
Refer to that table and any corresponding product documents before continuing to the next step.
Rebuild the NCN worker nodes. Follow the procedure in section 3.3 NCN worker nodes and then return to this procedure to complete the next step.
Configure NCN master nodes.
(ncn-m#
) Create a comma-separated list of the xnames for all NCN master nodes and verify they are correct.
MASTER_XNAMES=$(cray hsm state components list --role Management --subrole Master --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
echo "Master node xnames: $MASTER_XNAMES"
Get the CFS configuration created for management nodes during the prepare-images
and update-cfs-config
stages. Follow the instructions in the prepare-images
Artifacts created
documentation to get the value for configuration
for the image with a configuration_group_name
value matching Management_Master
.
(ncn-m#
) Set CFS_CONFIG_NAME
to the value for configuration
found in the previous step.
CFS_CONFIG_NAME=<appropriate configuration value>
(ncn-m#
) Apply the CFS configuration to NCN master nodes.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames $MASTER_XNAMES --clear-state
Sample output for configuring multiple management nodes is:
Taking snapshot of existing management-23.11.0 configuration to /root/apply_csm_configuration.20240305_173700.vKxhqC backup-management-23.11.0.json
Setting desired configuration, clearing state, clearing error count, enabling components in CFS
desiredConfig = "management-23.11.0"
enabled = true
errorCount = 0
id = "x3700c0s16b0n0"
state = []
[tags]
desiredConfig = "management-23.11.0"
enabled = true
errorCount = 0
id = "x3701c0s16b0n0"
state = []
[tags]
desiredConfig = "management-23.11.0"
enabled = true
errorCount = 0
id = "x3702c0s16b0n0"
state = []
[tags]
Waiting for configuration to complete. 3 components remaining.
Configuration complete. 3 component(s) completed successfully. 0 component(s) failed.
Configure NCN storage nodes.
(ncn-m#
) Create a comma-separated list of the xnames for all NCN storage nodes and verify they are correct.
STORAGE_XNAMES=$(cray hsm state components list --role Management --subrole Storage --type Node --format json | jq -r '.Components | map(.ID) | join(",")')
echo "Storage node xnames: $STORAGE_XNAMES"
Get the CFS configuration created for management storage nodes during the prepare-images
and update-cfs-config
stages. Follow the instructions in the prepare-images
Artifacts created
documentation to get the value for configuration
for the image with a configuration_group_name
value matching Management_Storage
.
(ncn-m#
) Set CFS_CONFIG_NAME
to the value for configuration
found in the previous step.
CFS_CONFIG_NAME=<appropriate configuration value>
(ncn-m#
) Apply the CFS configuration to NCN storage nodes.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${CFS_CONFIG_NAME}" --xnames $STORAGE_XNAMES --clear-state
Sample output for configuring multiple management nodes is:
Taking snapshot of existing minimal-management-23.11.0 configuration to /root/apply_csm_configuration.20240305_173700.vKxhqC backup-minimal-management-23.11.0.json
Setting desired configuration, clearing state, clearing error count, enabling components in CFS
desiredConfig = "minimal-management-23.11.0"
enabled = true
errorCount = 0
id = "x3700c0s16b0n0"
state = []
[tags]
desiredConfig = "minimal-management-23.11.0"
enabled = true
errorCount = 0
id = "x3701c0s16b0n0"
state = []
[tags]
desiredConfig = "minimal-management-23.11.0"
enabled = true
errorCount = 0
id = "x3702c0s16b0n0"
state = []
[tags]
Waiting for configuration to complete. 3 components remaining.
Configuration complete. 3 component(s) completed successfully. 0 component(s) failed.
Once this step has completed:
management-nodes-rollout
stageContinue to the next section 4. Restart goss-servers
on all NCNs.
NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow. This section describes how to test a new image and CFS configuration on a single canary node (ncn-w001
) first before
rolling it out to the other NCN worker nodes. Modify the procedure as necessary to accommodate site preferences for rebuilding NCN worker nodes.
The images and CFS configurations used are created by the prepare-images
and update-cfs-config
stages respectively; see the prepare-images
Artifacts created documentation
for details on how to query the images and CFS configurations and see the update-cfs-config documentation for details about how the CFS configuration is updated.
NOTE
The management-nodes-rollout
stage creates additional separate Argo workflows when rebuilding NCN worker nodes. The Argo workflow names will include the string ncn-lifecycle-rebuild
. If monitoring progress with the Argo UI,
remember to include these workflows.
NOTE
If upgrading from CSM 1.4 to CSM 1.5 with a COS release prior to 2.5.146 currently installed, a workaround is needed to roll out the management nodes. See the later subsection 3.3.1 DVS workaround upgrading from COS prior to
2.5.146. If the installed COS version is 2.5.146 or later, this is not needed.
The “Install and Upgrade Framework” section of each individual product’s installation document may contain special actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table that summarizes which product documents contain information or actions for the management-nodes-rollout
stage.
Refer to that table and any corresponding product documents before continuing to the next step.
(ncn-m001#
) Execute the management-nodes-rollout
stage with a single NCN worker node.
This will rebuild the canary node with the new CFS configuration and image built in previous steps of the workflow.
The worker canary node can be any worker node and does not have to be ncn-w001
.
WORKER_CANARY=ncn-w001
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ${WORKER_CANARY}
Verify the canary node booted successfully with the desired image and CFS configuration.
XNAME=$(ssh $WORKER_CANARY 'cat /etc/cray/xname')
echo "${XNAME}"
cray cfs components describe "${XNAME}"
(ncn-m001#
) Use kubectl
to apply the iuf-prevent-rollout=true
label to the canary node to prevent it from unnecessarily rebuilding again.
kubectl label nodes "${WORKER_CANARY}" --overwrite iuf-prevent-rollout=true
(ncn-m001#
) Verify the IUF node labels are present on the desired node.
kubectl get nodes --show-labels | grep iuf-prevent-rollout
(ncn-m001#
) Execute the management-nodes-rollout
stage on all remaining worker nodes.
NOTE
For this step, the argument to --limit-management-rollout
can be Management_Worker
or a list of worker
node names separated by spaces. If Management_Worker
is supplied, all worker nodes that are not labeled
with iuf-prevent-rollout=true
will be rebuilt/upgraded. If a list of worker node names is supplied, then those worker nodes will be rebuilt/upgraded.
Choose one of the following two options. The difference between the options is the limit-management-rollout
argument, but the two options do the same thing.
(ncn-m001#
) Execute management-nodes-rollout
on all Management_Worker
nodes.
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout Management_Worker
(ncn-m001#
) Execute management-nodes-rollout
on a group of worker nodes. The list of worker nodes can be manually edited if it is undesirable to rebuild/upgrade all of the workers with one execution.
WORKER_NODES=$(kubectl get node | grep -P 'ncn-w\d+' | grep -v $WORKER_CANARY | awk '{print $1}' | xargs)
echo $WORKER_NODES
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout $WORKER_NODES
(ncn-m001#
) Use kubectl
to remove the iuf-prevent-rollout=true
label from the canary node.
kubectl label nodes "${WORKER_CANARY}" --overwrite iuf-prevent-rollout-
(ncn-m001#
) Verify that all worker nodes configured successfully.
for ncn in $(cray hsm state components list --subrole Worker --type Node \
--format json | jq -r .Components[].ID | grep b0n | sort); do cray cfs components describe \
$ncn --format json | jq -r ' .id+" "+.desiredConfig+" status="+.configurationStatus'; done
Once this step has completed:
management-nodes-rollout
stageReturn to the procedure that was being followed for management-nodes-rollout
to complete the next step, either Management-nodes-rollout with CSM upgrade or
Management-nodes-rollout without CSM upgrade.
If COS prior to 2.5.146 is installed prior to upgrading to CSM 1.5, the management rollout in this step may hang. There is a workaround for this, copying the new version of the DVS
prechecks_for_worker_reboots
script to all NCN worker nodes as /opt/cray/shasta/cos/bin/prechecks_for_worker_reboots
This is to be run on the ncn-m001
node during this step 2.3 NCN worker nodes.
The new version of the script may be found in the cray-dvs-csm
rpm in
the USS CSM tar file in the upgrade’s media directory. The workaround
is to extract the script from the rpm to a temporary directory and
then copy it to the worker nodes.
It should be copied to the canary node when that node is being rebuilt, and to the remaining worker nodes after the canary node boot has succeeded.
(ncn-m001#
) Set an environment variable to the media directory, if not already set.
echo $MEDIA_DIR
MEDIA_DIR=/etc/cray/upgrade/csm/media/<directory>
(ncn-m001#
) Optionally, create and cd
to a temporary directory.
in which to extract the new version of the script.
mkdir /tmp/upgrade-prechecks_WAR
cd /tmp/upgrade-prechecks_WAR
(ncn-m001#
) Extract the cray-dvs-csm
rpm that’s included in the USS image:
rpm2cpio < $MEDIA_DIR/uss-*-csm-1.5/rpms/uss-*-csm-1.5/x86_64/cray-dvs-csm-*.x86_64.rpm | cpio -i --make-directories --no-absolute-filenames
(ncn-m001#
) Install the new version of the script onto all of the worker nodes. This is one way to do that:
SSH_OPTIONS='-o StrictHostKeyChecking=no -o ConnectTimeout=15 -o LogLevel=ERROR -o UserKnownHostsFile=/dev/null'
for name in $(kubectl get node | grep -P 'ncn-w\d+' | awk '{print $1}'); do
scp -p $SSH_OPTIONS opt/cray/shasta/cne/bin/prechecks_for_worker_reboots $name:/opt/cray/shasta/cos/bin/prechecks_for_worker_reboots
done
(ncn-m001#
) Optionally, remove the temporary directory.
cd ..
rm -rf upgrade-prechecks_WAR
After completing this workaround, return to 3.3 NCN worker nodes to roll out worker nodes.
goss-servers
on all NCNsThe goss-servers
service needs to be restarted on all NCNs. This ensures the correct tests are run on each NCN. This is necessary due to a timing issue that is fixed in CSM 1.6.1.
(ncn-m001#
) Restart goss-servers
.
ncn_nodes=$(grep -oP "(ncn-s\w+|ncn-m\w+|ncn-w\w+)" /etc/hosts | sort -u | tr -t '\n' ',')
ncn_nodes=${ncn_nodes%,}
pdsh -S -b -w $ncn_nodes 'systemctl restart goss-servers'
Continue to the next section 5. Update ceph node-exporter config for SNMP counters.
OPTIONAL: This is an optional step.
This uses netstat
collector form node-exporter and enables all the SNMP counters monitoring in /proc/net/snmp
on ncn
nodes.
See Update ceph node-exporter configuration to update the ceph node-exporter configuration to monitor SNMP counters.
Continue to the next section 6. Update management host Slingshot NIC firmware.
If new Slingshot NIC firmware was provided, refer to the “200Gbps NIC Firmware Management” section of the HPE Slingshot Operations Guide for details on how to update NIC firmware on management nodes.
After updating management host Slingshot NIC firmware, all nodes where the firmware was updated must be power cycled. Follow the reboot NCNs procedure for all nodes where the firmware was updated.
Once this step has completed:
deploy-product
and post-install-service-check
stagesIf performing an initial install or an upgrade of non-CSM products only, return to the Install or upgrade additional products with IUF workflow to continue the install or upgrade.
If performing an upgrade that includes upgrading CSM, return to the Upgrade CSM and additional products with IUF workflow to continue the upgrade.