Shut Down and Power Off Compute and User Access Nodes

Shut down and power off compute and user access nodes (UANs). This procedure powers off all compute nodes in the context of an entire system shutdown.

Prerequisites

The cray and sat commands must be initialized and authenticated with valid credentials for Keycloak. If these have not been prepared, then see Configure the Cray CLI and refer to the “SAT Authentication” section of the HPE Cray EX System Admin Toolkit (SAT) product stream documentation (S-8031) for instructions on how to acquire a SAT authentication token.

Procedure

  1. List detailed information about the available boot orchestration service (BOS) session template names.

    Identify the BOS session template names such as cos-2.0.x, uan-slurm, and choose the appropriate compute and UAN node templates for the shutdown.

    ncn-mw# cray bos sessiontemplate list --format toml
    

    Example output excerpts:

    [[results]]
    name = "cos-2.0.x"
    description = "BOS session template for booting compute nodes, generated by the installation"
    
    [...]
    
    name = "slurm"
    description = "BOS session template for booting compute nodes, generated by the installation"
    
    [...]
    
    name = "uan-slurm"
    description = "Template for booting UANs with Slurm"
    
  2. To display more information about a session template, for example cos-2.0.x, use the describe option.

    ncn-mw# cray bos sessiontemplate describe cos-2.0.x
    
  3. Use sat bootsys shutdown to shut down and power off UANs and compute nodes.

    Attention: Specify the required session templates for COS_SESSION_TEMPLATE and UAN_SESSION_TEMPLATE in the example.

    An optional --loglevel debug can be used to provide more information as the system shuts down. If used, it must be added after sat but before bootsys.

    ncn-mw# sat bootsys shutdown --stage bos-operations \
             --bos-templates COS_SESSION_TEMPLATE,UAN_SESSION_TEMPLATE
    

    Example output:

    Started boot operation on BOS session templates: cos-2.0.x, uan.
    Waiting up to 600 seconds for sessions to complete.
    
    Waiting for BOA k8s job with id boa-a1a697fc-e040-4707-8a44-a6aef9e4d6ea to complete. Session template: uan.
    To monitor the progress of this job, run the following command in a separate window:
        'kubectl -n services logs -c boa -f --selector job-name=boa-a1a697fc-e040-4707-8a44-a6aef9e4d6ea'
    
    Waiting for BOA k8s job with id boa-79584ffe-104c-4766-b584-06c5a3a60996 to complete. Session template: cos-2.0.0.
    To monitor the progress of this job, run the following command in a separate window:
        'kubectl -n services logs -c boa -f --selector job-name=boa-79584ffe-104c-4766-b584-06c5a3a60996'
    
    [...]
    
    All BOS sessions completed.
    
  4. Use the Job ID strings (for example, the cos-2.0.0 session, boa-79584ffe-104c-4766- b584-06c5a3a60996) from the previous command to monitor the progress of the boot of the compute nodes.

    The command to run is displayed in the output of the sat bootsys shutdown command.

    ncn-mw# kubectl logs -n services -c boa -f \
                 --selector job-name=boa-boa-79584ffe-104c-4766-b584-06c5a3a60996
    

    Example output:

    2020-08-21 17:27:02,358 - DEBUG   - cray.boa - BOA starting
    2020-08-21 17:27:02,358 - DEBUG   - cray.boa - Boot Agent Image:  created.
    2020-08-21 17:27:02,358 - INFO    - cray.boa - Boot Session: boa-79584ffe-104c-4766-b584-06c5a3a60996
    2020-08-21 17:27:02,371 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,373 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,395 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,437 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,519 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:02,681 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:03,003 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    2020-08-21 17:27:03,645 - INFO    - cray.boa.connection - Reattempting GET request for 'http://cray-cfs-api/apis/cfs/sessions'
    

    The BOS shutdown session may or may not power off compute nodes depending on the session template being used.

  5. In another shell window, use a similar command to monitor the UAN session.

    ncn-mw# kubectl -n services logs -c boa -f --selector job-name=boa-a1a697fc-e040-4707-8a44-a6aef9e4d6ea
    
  6. Check the status of UAN and compute nodes to verify they are Off.

    There may be delay in nodes reaching the Off state in the hardware state manager (HSM).

    ncn-mw# sat status
    

    Example output:

    +----------------+------+----------+-------+------+---------+------+----------+-------------+----------+
    | xname          | Type | NID      | State | Flag | Enabled | Arch | Class    | Role        | Net Type |
    +----------------+------+----------+-------+------+---------+------+----------+-------------+----------+
    | x1000c0s0b0n0  | Node | 1001     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s0b0n1  | Node | 1002     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s0b1n0  | Node | 1003     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s0b1n1  | Node | 1004     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s1b0n0  | Node | 1005     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s1b0n1  | Node | 1006     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s1b1n0  | Node | 1007     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c0s1b1n1  | Node | 1008     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c1s0b0n0  | Node | 1033     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c1s0b0n1  | Node | 1034     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    | x1000c1s0b1n0  | Node | 1035     | Off   | OK   | True    | X86  | Mountain | Compute     | Sling    |
    

Next steps

Return to System Power Off Procedures and continue with next step.