The ability to ignore non-compute nodes (NCNs) is turned off by default. Management nodes, NCNs, and their BMCs are also not locked by default. The administrator must lock the NCNs and their BMCs to prevent unwanted actions from affecting these nodes.
This section only covers using locks with the Hardware State Manager (HSM). For more information on ignoring nodes, refer to the following sections:
The following actions can be prevented when a node and its BMC is locked.
Doing any of these actions by accident will shut down a management node. If the node is a Kubernetes master or worker node, this can have serious negative effects on system operations. If a single node is taken down by mistake, it is possible that services will recover. If all management nodes are taken down, or all Kubernetes worker nodes are taken down by mistake, the system must be restarted.
After critical nodes are locked, power/reset (CAPMC) or firmware (FAS) operations cannot affect the nodes unless they are unlocked. For example, any locked node that is included in a list of nodes to be reset will result in a failure.
To best protect system health, NCNs and their BMCs should be locked as early as possible in the install/upgrade cycle. The later in the process, the more risk there is of accidentally taking down a critical node. NCN locking must be done after Kubernetes is running and the HSM service is operational.
Check whether HSM is running with the following command:
ncn# kubectl -n services get pods | grep smd
Example output:
cray-smd-848bcc875c-6wqsh 2/2 Running 0 9d
cray-smd-848bcc875c-hznqj 2/2 Running 0 9d
cray-smd-848bcc875c-tp6gf 2/2 Running 0 6d22h
cray-smd-init-2tnnq 0/2 Completed 0 9d
cray-smd-postgres-0 2/2 Running 0 19d
cray-smd-postgres-1 2/2 Running 0 6d21h
cray-smd-postgres-2 2/2 Running 0 19d
cray-smd-wait-for-postgres-4-7c78j 0/3 Completed 0 9d
The cray-smd
pods need to be in the Running
state, except for cray-smd-init
and
cray-smd-wait-for-postgres
which should be in Completed
state.
Any time a management NCN has to be power cycled, reset, or have its firmware updated, it and its BMC will first need to be unlocked. After the operation is complete, the targeted nodes and BMCs should once again be locked.
Run the lock_management_nodes.py
script to lock all management nodes and BMCs that are not already locked:
ncn# /opt/cray/csm/scripts/admin_access/lock_management_nodes.py
The return value of the script is 0 if locking was successful. A non-zero return code means that manual intervention may be needed to lock the nodes. Continue below for manual steps.
Use the cray hsm locks lock
command to perform locking.
NOTE: When locking NCNs, you must lock their node BMCs as well.
NOTE: The following steps assume both the management nodes and their BMCs are marked with the Management
role in HSM. If they are not, see Set BMC Management Role.
The processing-model rigid
parameter means that the operation must succeed on all
target nodes or the entire operation will fail.
Lock the management nodes and BMCs.
ncn# cray hsm locks lock create --role Management --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 16
Success = 16
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s5b0n0", "x3000c0s4b0n0", "x3000c0s7b0n0", "x3000c0s6b0n0", "x3000c0s3b0n0", "x3000c0s2b0n0", "x3000c0s9b0n0", "x3000c0s8b0n0",
"x3000c0s5b0", "x3000c0s4b0", "x3000c0s7b0", "x3000c0s6b0", "x3000c0s3b0", "x3000c0s2b0", "x3000c0s9b0", "x3000c0s8b0",]
Note: The BMC of
ncn-m001
typically does not exist in HSM under HSM State Components, and therefore cannot be locked.
Lock the management nodes and BMCs.
ncn# cray hsm locks lock create --role Management --component-ids x3000c0s6b0n0,x3000c0s6b0 --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 2
Success = 2
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s6b0n0", "x3000c0s6b0",]
NOTE
The BMC ofncn-m001
typically does not exist in HSM under HSM State Components, and therefore would not show up in the following command output.
Check the lock status of the management nodes and BMCs.
cray hsm state components list --type Node --role Management --format json | \
jq -c '.Components[]|.ID' | tr '\n' ',' | sed 's/,$/\n/' | \
xargs cray hsm locks status create --format toml --component-ids
Example output:
[[Components]]
ID = "x3000c0s1b0n0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s5b0n0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s4b0n0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s7b0n0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s6b0n0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s3b0n0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s3b0n0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s9b0n0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s8b0n0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s5b0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s4b0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s7b0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s6b0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s3b0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s3b0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s9b0"
Locked = true
Reserved = false
ReservationDisabled = false
[[Components]]
ID = "x3000c0s8b0"
Locked = true
Reserved = false
ReservationDisabled = false
Use the cray hsm locks unlock
command to perform unlocking.
NOTE: When unlocking NCNs, you must unlock their node BMCs as well.
NOTE: The following steps assume both the management nodes and their BMCs are marked with the Management
role in HSM. If they are not, see Set BMC Management Role.
Unlock the management nodes and BMCs.
ncn# cray hsm locks unlock create --role Management --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 16
Success = 16
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s7b0n0", "x3000c0s6b0n0", "x3000c0s3b0n0", "x3000c0s2b0n0", "x3000c0s9b0n0", "x3000c0s8b0n0", "x3000c0s5b0n0", "x3000c0s4b0n0",
"x3000c0s5b0", "x3000c0s4b0", "x3000c0s7b0", "x3000c0s6b0", "x3000c0s3b0", "x3000c0s2b0", "x3000c0s9b0", "x3000c0s8b0",]
Unlock the management nodes.
ncn# cray hsm locks unlock create --role Management --component-ids x3000c0s6b0n0,x3000c0s6b0 --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 2
Success = 2
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s6b0n0", "x3000c0s6b0",]