Rolling Upgrades using BOS

NOTE This feature is the replacement for the Compute Rolling Upgrade Service (CRUS). CRUS was deprecated in CSM 1.2.0 and removed in CSM 1.5.0. See Deprecated Features.

BOS v2 allows users to stage boot artifacts, configuration, and an operation such as a reboot. The workload manager can later trigger the operation through BOS to apply that staged information, allowing rolling updates when nodes have no job running on them.

Workflow

  1. An administrator configures the workload manager to call the applystaged endpoint of BOS with a payload containing the xnames of the components to be operated on. For more information on the endpoint and payload, see Applying a staged state

    NOTE The boot artifacts and configuration staged with BOS will not be applied if the node is rebooted outside BOS. This is because BOS is caching the staged boot information and configuration internally, but not updating the Boot Script Service (BSS) and the Configuration Framework Service (CFS) until immediately before it boots or reboots the nodes.

  2. An administrator stages all of the boot information through BOS v2 by creating a session with the staged parameter.

    BOS will cache the boot artifacts and configuration and associate that information with the specified nodes. These nodes will not be booted or rebooted as a part of this staging. For more information on staging sessions, see Creating a staged session

  3. The administrator indicates to the workload manager that a node reboot is needed.

  4. The workload manager calls the applystaged endpoint for each node when it is ready.

    BOS then copies the information staged for these components into their desired state, and BOS starts to operate on the nodes and attempts to make their actual state match their new desired state.

Using staged sessions with Slurm

HPE provides the slurm-reboot.py script which will call BOS to apply a staged session. This featured was introduced with Slurm 1.2.5 for Cray Programming Environment (CPE) 22.10. Slurm can be configured to call this reboot script using the RebootProgram value in the slurm.conf file.

SlurmctldParameters=reboot_from_controller
RebootProgram=/slurm-reboot.py

See the Slurm documentation for more information on configuring this value.

Once this configuration is in place and a staged session has been created, administrators can issue a scontrol reboot command to Slurm. Slurm will then use the reboot script to call the BOS applystaged endpoint.

(uan#) A sample reboot command:

scontrol reboot nextstate=Down Reason="Rolling Reboot" nid00000[6-7]