Shut Down and Power Off Compute and User Access Nodes

Shut down and power off compute and user access nodes (UANs). This procedure powers off all compute nodes in the context of an entire system shutdown.

Prerequisites

The cray and sat commands must be initialized and authenticated with valid credentials for Keycloak. If these have not been prepared, then see Configure the Cray Command Line Interface (cray CLI) and refer to the “SAT Authentication” section of the HPE Cray EX System Admin Toolkit (SAT) (S-8031) product stream documentation for instructions on how to acquire a SAT authentication token.

Procedure

  1. (ncn-mw#) List detailed information about the available boot orchestration service (BOS) session template names.

    Identify the BOS session template names such as cos-2.0.x, uan-slurm, and choose the appropriate compute and UAN node templates for the shutdown.

    cray bos v1 sessiontemplate list --format toml
    

    Example output excerpts:

    [[results]]
    name = "cos-2.0.x"
    description = "BOS session template for booting compute nodes, generated by the installation"
    
    [...]
    
    name = "slurm"
    description = "BOS session template for booting compute nodes, generated by the installation"
    
    [...]
    
    name = "uan-slurm"
    description = "Template for booting UANs with Slurm"
    
  2. (ncn-mw#) To display more information about a session template, for example cos-2.0.x, use the describe option.

    cray bos v1 sessiontemplate describe cos-2.0.x
    
  3. (ncn-mw#) Use sat bootsys shutdown to shut down and power off UANs and compute nodes.

    Attention: Specify the required session templates for COS_SESSION_TEMPLATE and UAN_SESSION_TEMPLATE in the example.

    An optional --loglevel debug can be used to provide more information as the system shuts down. If used, it must be added after sat but before bootsys.

    sat bootsys shutdown --stage bos-operations \
             --bos-templates COS_SESSION_TEMPLATE,UAN_SESSION_TEMPLATE
    

    Example output:

    Started boot operation on BOS session templates: cos-2.0.x, uan.
    Waiting up to 600 seconds for sessions to complete.
    
    Waiting for BOA k8s job with id boa-a1a697fc-e040-4707-8a44-a6aef9e4d6ea to complete. Session template: uan.
    To monitor the progress of this job, run the following command in a separate window:
        'kubectl -n services logs -c boa -f --selector job-name=boa-a1a697fc-e040-4707-8a44-a6aef9e4d6ea'
    
    Waiting for BOA k8s job with id boa-79584ffe-104c-4766-b584-06c5a3a60996 to complete. Session template: cos-2.0.0.
    To monitor the progress of this job, run the following command in a separate window:
        'kubectl -n services logs -c boa -f --selector job-name=boa-79584ffe-104c-4766-b584-06c5a3a60996'
    
    [...]
    
    All BOS sessions completed.
    

    Note: In certain cases, the command may display an error similar to the following:

    ERROR: Failed to get state of nodes in session template 'UAN_SESSION_TEMPLATE': Failed to get state of nodes with role=['Application', 'Application_UAN'] for boot set 'BOOT_SET' of session template 'UAN_SESSION_TEMPLATE': GET request to URL 'https://api-gw-service-nmn.local/apis/smd/hsm/v2/State/Components' failed with status code 400: Bad Request. Bad Request Detail: bad query param: Argument was not a valid HMS Role
    

    This is a non-fatal error and does not affect the bos-operations stage of sat bootsys.

    Note: In certain cases, the command may fail before reaching the displayed timeout and show warnings similar to the following:

    WARNING: The 'kubectl wait' command failed instead of timing out. stderr: error: condition not met for jobs/boa-79584ffe-104c-4766- b584-06c5a3a60996
    

    The BOS operation can still proceed even with these warnings. However, the warnings may result in the bos-operations stage of the sat bootsys command exiting before the BOS operation is complete. Because of this, it is important to view the logs in order to monitor the boot and to verify that the nodes reached the expected state. Both of these recommendations are shown in the remaining steps.

  4. Use the Job ID strings from the previous command to monitor the shutdown progress of the compute nodes.

    For example, use the cos-2.0.0 session boa-79584ffe-104c-4766- b584-06c5a3a60996.

    The command to run is displayed in the output of the sat bootsys shutdown command.

    kubectl logs -n services -c boa -f \
                 --selector job-name=boa-79584ffe-104c-4766-b584-06c5a3a60996
    

    Example output:

    2020-08-21 17:27:02,358 - DEBUG   - cray.boa - BOA starting
    2020-08-21 17:27:02,358 - DEBUG   - cray.boa - Boot Agent Image:  created.
    2020-08-21 17:27:02,358 - INFO    - cray.boa - Boot Session: boa-79584ffe-104c-4766-b584-06c5a3a60996
    2020-08-21 17:27:02,371 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,373 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,395 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,437 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,519 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,681 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:03,003 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:03,645 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    

    The BOS shutdown session may or may not power off compute nodes depending on the session template being used.

  5. (ncn-mw#) In another shell window, use a similar command to monitor the UAN session.

    kubectl -n services logs -c boa -f --selector job-name=boa-a1a697fc-e040-4707-8a44-a6aef9e4d6ea
    
  6. (ncn-mw#) Check the status of both UAN and compute nodes to verify that they are Off.

    There may be a delay in nodes reaching the Off state in the hardware state manager (HSM).

    sat status
    

    Example output:

    +----------------+------+----------+-------+------+---------+------+----------+-------------+----------+
    | xname          | Type | NID      | State | Flag | Enabled | Arch | Class    | Role        | Net Type |
    +----------------+------+----------+-------+------+---------+------+----------+-------------+----------+
    | x1000c0s0b0n0  | Node | 1001     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s0b0n1  | Node | 1002     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s0b1n0  | Node | 1003     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s0b1n1  | Node | 1004     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s1b0n0  | Node | 1005     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s1b0n1  | Node | 1006     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s1b1n0  | Node | 1007     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s1b1n1  | Node | 1008     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c1s0b0n0  | Node | 1033     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c1s0b0n1  | Node | 1034     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c1s0b1n0  | Node | 1035     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    
    [...]
    

Next Steps

Return to System Power Off Procedures and continue with next step.