This section updates the software running on managed compute and application (UAN, etc.) nodes.
OpenSearchmanaged-nodes-rollout stage
post-install-check stageRefer to Update Firmware with FAS for details on how to upgrade the firmware on managed nodes using FAS.
Once this step has completed:
OpenSearchWhen moving from CSM 1.3/SMA 1.7 to CSM 1.4/SMA 1.8 or when fresh installing a
CSM 1.4/SMA 1.8 system, tuning OpenSearch needs to be considered.
During an upgrade of SMA, OpenSearch replaces ElasticSearch;
any ElasticSearch tuning that was used previously will not be relevant.
Before booting compute nodes, refer to the “ConfigureOpenSearch” section in the
HPE Cray EX System Monitoring Application Administration Guide for instructions on tuning OpenSearch.
managed-nodes-rollout stageThis section describes how to update software on managed nodes. It describes how to test a new IMS image
and CFS configuration on a single “canary node” first before rolling it out to the other
managed nodes. Modify the procedure as necessary to accommodate site preferences for rebooting managed nodes.
If the system has heterogeneous nodes, it may be desirable to repeat this process with multiple canary nodes,
one for each distinct node configuration. The IMS images, CFS configurations, and
BOS session templates used are created by the prepare-images stage; see the
prepare-images Artifacts created documentation for details on
how to query the IMS images and CFS configurations.
NOTE Additional arguments are available to control the behavior of the managed-nodes-rollout stage. See
the managed-nodes-rollout stage documentation for details and adjust the
examples below if necessary.
LNet router nodes or gateway nodes should be upgraded before rebooting compute nodes to new IMS images and CFS configurations. Because LNet routers and gateway nodes are examples of application nodes, the instructions in this section are the same as in 3.3 Application nodes.
Because LNet router nodes and gateway nodes are not managed by workload managers, the IUF
managed-nodes-rollout stage cannot reboot them in a controlled manner via the -mrs stage argument. The
IUF managed-nodes-rollout stage can reboot LNet router and gateway nodes using the -mrs reboot argument;
however, this is not recommended, because an immediate reboot of the nodes is likely to be disruptive to users and
overall system health. Administrators should determine the best approach for rebooting LNet router and gateway nodes
outside of IUF that aligns with site preferences.
Once this step has completed:
managed-nodes-rollout stage if IUF
managed-nodes-rollout procedures were used to perform the reboots.The “Install and Upgrade Framework” section of each individual product’s installation document may contain special
actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table
that summarizes which product documents contain information or actions for the managed-nodes-rollout stage. Refer
to that table and any corresponding product documents before continuing to the next step.
Invoke iuf run with -r to execute the managed-nodes-rollout stage on a
single node to ensure the node reboots successfully with the desired IMS image and CFS
configuration. This node is referred to as the “canary node” in the remainder of this section. Use
--limit-managed-rollout to target the canary node only and use -mrs reboot to reboot the canary node immediately.
(ncn-m001#) Execute the managed-nodes-rollout stage with a single xname, rebooting the canary node
immediately. Replace the example value of ${XNAME} with the xname of the canary node.
XNAME=x3000c0s29b1n0
iuf -a "${ACTIVITY_NAME}" run -r managed-nodes-rollout --limit-managed-rollout "${XNAME}" -mrs reboot
Verify that the canary node booted successfully with the desired IMS image and CFS configuration.
Configure Slurm to stage data in BOS.
Follow the instructions in the section
Using staged sessions with Slurm
of the Rolling Upgrades using BOS documentation. These instructions
describe two parameters that must be set in the slurm.conf file. Return to these instructions after setting them.
(ncn-m001#) Execute the managed-nodes-rollout stage.
This will not reboot the nodes immediately; instead, it will stage data to BOS and allow the workload manager to reboot the nodes when it is ready to do so.
If an immediate reboot of compute nodes is desired instead, add -mrs reboot to the iuf run command.
NOTE: If the -mrs reboot option is used with Slurm, then skip the remaining steps in this section and proceed
directly to 3.2.3 Compute reboot complete.
iuf -a "${ACTIVITY_NAME}" run -r managed-nodes-rollout
List all of the compute nodes in the system by their node IDs (NIDs).
NOTE: If the
-mrs rebootoption was used with Slurm in the previous step, then skip the remaining steps in this section and proceed directly to 3.2.3 Compute reboot complete.
Tell Slurm to reboot the compute nodes.
This only works for compute nodes, and they must be specified explicitly. Using the keyword
ALLto specify the nodes does not work.
Use the compute list output from the previous step as the last argument.
(compute#) A sample reboot command to reboot NIDs 1 through 4.
scontrol reboot nextstate=Resume Reason="IUF Managed Nodes Rollout" nid[000001-000004]
Proceed to 3.2.3 Compute reboot complete.
WARNING If any compute nodes are disabled when the
managed-nodes-rollout stage is run, then it will appear to hang. This is because it is incorrectly waiting on disabled
nodes to complete as well. After 100 minutes it will timeout. If a timeout is experienced and the system has compute
nodes that are disabled, then this is the most likely explanation. In that case, the timeout may be ignored; the enabled
compute nodes have successfully completed the rollout and are usable immediately.
To check whether the timeout was indeed caused by waiting on disabled nodes, follow this procedure:
(ncn-m001#) Enter the SAT container.
sat bash
(sat-container#) Determine the percentage of compute nodes that is expected to be completed.
TotalComputes="$(sat status --filter role=Compute --no-borders --no-headings --fields xname|xargs |wc -w)"
TotalEnabledComputes="$(sat status --filter enabled=True --filter role=Compute --no-borders --no-headings --fields xname|xargs |wc -w)"
ExpectedPercentage=$(bc -l <<< "$TotalEnabledComputes/$TotalComputes*100")
echo "Expected Percentage: ${ExpectedPercentage}"
(sat-container#) Exit the SAT container.
exit
Check the managed-nodes-rollout logs to see what percentage of nodes actually completed.
Verify that percentage complete equals the expected percentage determined in the previous step.
If the two totals are equal, then ignore the timeout and proceed to 3.2.3 Compute reboot complete.
Once this step has completed:
managed-nodes-rollout stage.NOTE If LNet router or gateway nodes were upgraded in the 3.1 LNet router nodes and gateway nodes section, there is no need to upgrade them again in this section. Follow the instructions in this section to upgrade any remaining application nodes (UANs, etc.) that have not been upgraded yet.
Because application nodes are not managed by workload managers, the IUF managed-nodes-rollout stage
cannot reboot them in a controlled manner via the -mrs stage argument. The IUF managed-nodes-rollout stage
can reboot application nodes nodes using the -mrs reboot argument; however, this is not recommended, because an
immediate reboot of the nodes is likely to be disruptive to users and overall system health. Administrators should
determine the best approach for rebooting application nodes outside of IUF that aligns with site preferences.
Once this step has completed:
managed-nodes-rollout stage if IUF managed-nodes-rollout
procedures were used to perform the reboots.If new Slingshot NIC firmware was provided, refer to the “200Gbps NIC Firmware Management” section of the HPE Slingshot Operations Guide for details on how to update NIC firmware on managed nodes.
Once this step has completed:
post-install-check stageThe “Install and Upgrade Framework” section of each individual product’s installation document may contain special
actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table
that summarizes which product documents contain information or actions for the post-install-check stage. Refer to
that table and any corresponding product documents before continuing to the next step.
Invoke iuf run with -r to execute the post-install-check stage.
(ncn-m001#) Execute the post-install-check stage.
iuf -a "${ACTIVITY_NAME}" run -r post-install-check
Once this step has completed:
post-install-check stage to verify product software is executing as
expected.If performing an initial install or an upgrade of non-CSM products only, return to the Install or upgrade additional products with IUF workflow to continue the install or upgrade.
If performing an upgrade that includes upgrading CSM with IUF, return to the Upgrade CSM and additional products with IUF workflow to continue the upgrade.