NOTE
This section is for Boot Orchestration Service (BOS) v2 only.
NOTE
This feature is the replacement for the Compute Rolling Upgrade Service (CRUS). CRUS was deprecated in CSM 1.2.0 and it will be removed in CSM 1.5.0. See Deprecated Features.
BOS v2 allows users to stage boot artifacts, configuration, and an operation such as a reboot. The workload manager can later trigger the operation through BOS to apply that staged information, allowing rolling updates when nodes have no job running on them.
An administrator configures the workload manager to call the applystaged
endpoint of BOS with a payload containing the xnames
of the components to be operated on.
For more information on the endpoint and payload, see Applying a staged state
NOTE
The boot artifacts and configuration staged with BOS will not be applied if the node is rebooted outside BOS. This is because BOS is caching the staged boot information and configuration internally, but not updating the Boot Script Service (BSS) and the Configuration Framework Service (CFS) until immediately before it boots or reboots the nodes.
An administrator stages all of the boot information through BOS v2 by creating a session with the staged
parameter.
BOS will cache the boot artifacts and configuration and associate that information with the specified nodes. These nodes will not be booted or rebooted as a part of this staging. For more information on staging sessions, see Creating a staged session
The administrator indicates to the workload manager that a node reboot is needed.
The workload manager calls the applystaged
endpoint for each node when it is ready.
BOS then copies the information staged for these components into their desired state, and BOS starts to operate on the nodes and attempts to make their actual state match their new desired state.
HPE provides the slurm-reboot.py
script which will call BOS to apply a staged session.
This featured was introduced with Slurm 1.2.5 for Cray Programming Environment (CPE) 22.10.
Slurm can be configured to call this reboot script using the RebootProgram
value in the slurm.conf
file.
SlurmctldParameters=reboot_from_controller
RebootProgram=/slurm-reboot.py
See the Slurm documentation for more information on configuring this value.
Once this configuration is in place and a staged session has been created, administrators can issue a scontrol reboot
command to Slurm.
Slurm will then use the reboot script to call the BOS applystaged
endpoint.
(uan#
) A sample reboot command:
scontrol reboot nextstate=Down Reason="Rolling Reboot" nid00000[6-7]