This section updates the software running on managed compute and application (UAN, etc.) nodes.
managed-nodes-rollout
stage
post-install-check
stageRefer to Update Firmware with FAS for details on how to upgrade the firmware on managed nodes.
Once this step has completed:
managed-nodes-rollout
stageThis section describes how to update software on managed nodes. It describes how to test a new image and CFS
configuration on a single “canary node” first before rolling it out to the other managed nodes. Modify the procedure as
necessary to accommodate site preferences for rebooting managed nodes. If the system has heterogeneous nodes, it may be
desirable to repeat this process with multiple canary nodes, one for each distinct node configuration. The images, CFS
configurations, and BOS session templates used are created by the prepare-images
stage; see
the prepare-images
Artifacts created documentation for details on how
to query the images and CFS configurations.
NOTE
Additional arguments are available to control the behavior of the managed-nodes-rollout
stage. See
the managed-nodes-rollout
stage documentation for details and adjust the
examples below if necessary.
LNet router nodes or gateway nodes should be upgraded before rebooting compute nodes to new images and CFS configurations. Since LNet routers and gateway nodes are examples of application nodes, the instructions in this section are the same as in 2.3 Application nodes.
Since LNet router nodes and gateway nodes are not managed by workload managers, the IUF managed-nodes-rollout
stage
cannot reboot them in a controlled manner via the -mrs stage
argument. The IUF managed-nodes-rollout
stage can
reboot LNet router and gateway nodes using the -mrs reboot
argument, but an immediate reboot of the nodes is likely to
be disruptive to users and overall system health and is not recommended. Administrators should determine the best
approach for rebooting LNet router and gateway nodes outside of IUF that aligns with site preferences.
Once this step has completed:
managed-nodes-rollout
stage if IUF managed-nodes-rollout
procedures
were used to perform the rebootsThe “Install and Upgrade Framework” section of each individual product’s installation document may contain special
actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table
that summarizes which product documents contain information or actions for the managed-nodes-rollout
stage. Refer
to that table and any corresponding product documents before continuing to the next step.
Before booting computes, consider if SMA OpenSearch
needs to be tuned.
Refer to the “Configure OpenSearch
” section in the HPE Cray EX System Monitoring Application Administration Guide for instructions on tuning OpenSearch
.
Invoke iuf run
with -r
to execute the managed-nodes-rollout
stage on a
single node to ensure the node reboots successfully with the desired image and CFS configuration. This node is
referred to as the “canary node” in the remainder of this section. Use --limit-managed-rollout
to target the canary
node only and use -mrs reboot
to reboot the canary node immediately.
(ncn-m001#
) Execute the managed-nodes-rollout
stage with a single xname, rebooting the canary node immediately.
Replace the example value of ${XNAME}
with the xname of the canary node.
XNAME=x3000c0s29b1n0
iuf -a "${ACTIVITY_NAME}" run -r managed-nodes-rollout --limit-managed-rollout "${XNAME}" -mrs reboot
Verify the canary node booted successfully with the desired image and CFS configuration.
Invoke iuf run
with -r
to execute the managed-nodes-rollout stage on all
nodes. This will stage data to BOS and allow the workload manager to reboot the nodes when it is ready to do so. The
workload manager must be configured to tell BOS to reboot nodes using this staged data.
If PBS is the workload manager:
Create a maintenance reservation in PBS. For more information on maintenance reservations, see the PBS Professional Administrator’s Guide.
(ncn-m001#
) After the reservation has started, execute the managed-nodes-rollout
stage with the -mrs reboot
option to
immediately reboot all compute nodes.
iuf -a "${ACTIVITY_NAME}" run -r managed-nodes-rollout -mrs reboot
If Slurm is the workload manager:
Follow the instructions in the section
Using staged sessions with Slurm
of the Rolling Upgrades using BOS documentation. These instructions
describe two parameters that must be set in the slurm.conf
file. Return to these instructions after setting them.
(ncn-m001#
) Execute the managed-nodes-rollout
stage. If an immediate reboot of compute nodes is desired instead,
add -mrs reboot
to the iuf run
command.
iuf -a "${ACTIVITY_NAME}" run -r managed-nodes-rollout
NOTE: If the -mrs reboot
option is used with Slurm, skip the following step.
Tell Slurm to reboot the compute nodes. This only works for compute nodes, and they must be specified explicitly.
Use this command to list all of the compute nodes in the system using their Node Identities (NIDs). First, enter the
SAT bash shell using sat bash
.
(ncn-m001#
) Enter the sat container and fetch the compute list
sat bash
(ef637ae8a8b5) sat-container:/sat/share # sat status --fields xname --filter role=compute --no-headings --no-borders | xargs sat xname2nid
nid[000001-000004]
(ef637ae8a8b5) sat-container:/sat/share # exit
logout
Now, tell the workload manager to reboot the compute nodes. Paste the output from the previous step as the last argument.
(compute#
) A sample reboot command to reboot NIDs 1 through 4.
scontrol reboot nextstate=Resume Reason="IUF Managed Nodes Rollout" nid[000001-000004]
Once this step has completed:
managed-nodes-rollout
stageNOTE
If LNet router or gateway nodes were upgraded in
the 2.1 LNet router nodes and gateway nodes section, there is no need to
upgrade them again in this section. Follow the instructions in this section to upgrade any remaining applications (UANs,
etc.) that have not been upgraded yet.
Since application nodes are not managed by workload managers, the IUF managed-nodes-rollout
stage cannot reboot them
in a controlled manner via the -mrs stage
argument. The IUF managed-nodes-rollout
stage can reboot application nodes
using the -mrs reboot
argument, but an immediate reboot of application nodes is likely to be disruptive to users and
overall system health and is not recommended. Administrators should determine the best approach for rebooting
application nodes outside of IUF that aligns with site preferences.
Once this step has completed:
managed-nodes-rollout
stage if IUF managed-nodes-rollout
procedures
were used to perform the rebootsIf new Slingshot NIC firmware was provided, refer to the “200Gbps NIC Firmware Management” section of the HPE Slingshot Operations Guide for details on how to update NIC firmware on managed nodes.
Once this step has completed:
post-install-check
stageThe “Install and Upgrade Framework” section of each individual product’s installation document may contain special
actions that need to be performed outside of IUF for a stage. The “IUF Stage Documentation Per Product”
section of the HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052) provides a table
that summarizes which product documents contain information or actions for the post-install-check
stage. Refer to
that table and any corresponding product documents before continuing to the next step.
Invoke iuf run
with -r
to execute the post-install-check
stage.
(ncn-m001#
) Execute the post-install-check
stage.
iuf -a "${ACTIVITY_NAME}" run -r post-install-check
Once this step has completed:
post-install-check
stage to verify product software is executing as
expectedIf performing an initial install or an upgrade of non-CSM products only, return to the Install or upgrade additional products with IUF workflow to continue the install or upgrade.
If performing an upgrade that includes upgrading CSM, return to the Upgrade CSM and additional products with IUF workflow to continue the upgrade.