Power on liquid-cooled and standard rack cabinet PDUs.
Liquid-cooled Cabinets - HPE Cray EX liquid-cooled cabinet CDU and PDU circuit breakers are controlled manually.
After the CDU is switched on and healthy, the liquid-cooled PDU circuit breakers can be switched ON. With PDU breakers ON, the Chassis Management Modules (CMM) and Cabinet Environmental Controllers (CEC) power on and boot. These devices can then communicate with the management cluster and larger system management network. HVDC power remains OFF on liquid-cooled chassis until environmental conditions are normal and the CMMs receive a chassis power-on command from Cray System Management (CSM) software.
Standard Racks - HPE Cray standard EIA racks include redundant PDUs. Some PDU models may require a flat-blade screw driver to open or close the PDU circuit breakers.
sat
command. See
the “SAT Authentication” section of the HPE Cray EX System Admin Toolkit (SAT) product stream documentation (S-8031
) for
instructions on how to acquire a SAT authentication token.Verify with site management that it is safe to power on the system.
If the system does not have Cray EX liquid-cooled cabinets, proceed to Power On Standard Rack PDU Circuit Breakers.
Power on the CDU for the cabinet cooling group.
Open the rear door of the CDU.
Set the control panel circuit breakers to ON.
Set the PDU circuit breakers to on in each Cray EX cabinet.
Verify the status LEDs on the PSU are OK.
(ncn-m001#
) Unsuspend the hms-discovery cronjob
.
kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : false }}'
Example output.
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hms-discovery */3 * * * * False 1 41s 33d
A value of False
in the SUSPEND
column indicates that the cronjob is no longer suspended. A
value of 1
in the ACTIVE
column indicates that a Kubernetes job is currently running for the
cronjob.
(ncn-m001#
) Use the System Admin Toolkit (sat
) to power on liquid-cooled cabinets, chassis, and slots.
sat bootsys boot --stage cabinet-power
This command first resumes the hms-discovery
Kubernetes cronjob and waits for it to be
scheduled. Then, the hms-discovery
job initiates power-on of the liquid-cooled cabinets.
Finally, the sat bootsys
command waits for the components in the liquid-cooled cabinets to be
powered on. The sat bootsys
command controls power only to liquid-cooled cabinets.
The sat bootsys
command may time out while waiting for the hms-discovery
cronjob to be
scheduled and display the following message:
ERROR: The cronjob hms-discovery in namespace services was not scheduled within expected window after being resumed.
If this occurs, first check if the cronjob needs to be re-created. To do this, follow the instructions
in the Check cronjobs
section of the Power On and Start the Management Kubernetes Cluster
procedure.
If the cronjob does not need to be re-created and has been scheduled within the time expected
(based on its cron schedule), execute the sat bootsys boot --stage cabinet-power
command
again.
If sat bootsys
fails to power on the cabinets through hms-discovery
, then use CAPMC to manually power on the cabinet chassis,
compute blade slots, and all populated switch blade slots (1, 3, 5, and 7). This example shows cabinets 1000-1003.
cray capmc xname_on create --xnames x[1000-1003]c[0-7] --format json
cray capmc xname_on create --xnames x[1000-1003]c[0-7]s[0-7] --format json
cray capmc xname_on create --xnames x[1000-1003]c[0-7]r[1,3,5,7] --format json
(ncn-m001#
) Check the power status for every liquid-cooled cabinet Chassis.
The State
should be On
for every Chassis.
sat status --types Chassis
Example output.
+---------+---------+-------+------+---------+------+----------+----------+
| xname | Type | State | Flag | Enabled | Arch | Class | Net Type |
+---------+---------+-------+------+---------+------+----------+----------+
| x1020c0 | Chassis | On | OK | True | X86 | Mountain | Sling |
| x1020c1 | Chassis | On | OK | True | X86 | Mountain | Sling |
| x1020c2 | Chassis | On | OK | True | X86 | Mountain | Sling |
| x1020c3 | Chassis | On | OK | True | X86 | Mountain | Sling |
| x1020c4 | Chassis | On | OK | True | X86 | Mountain | Sling |
| x1020c5 | Chassis | On | OK | True | X86 | Mountain | Sling |
| x1020c6 | Chassis | On | OK | True | X86 | Mountain | Sling |
| x1020c7 | Chassis | On | OK | True | X86 | Mountain | Sling |
...
+---------+---------+-------+------+---------+------+----------+----------+
Switch the standard rack compute cabinet PDU circuit breakers to ON.
This applies power to the server BMCs and connects them to the management network. Compute nodes do not power on and boot automatically. The Boot Orchestration Service (BOS) brings up compute nodes and User Access Nodes (UANs).
If necessary, use IPMI commands to power on individual servers as needed.
Verify that all system management network switches and Slingshot network switches are powered on in each rack, and that there are no error LEDS or hardware failures.
Return to System Power On Procedures and continue with next step.