The ability to ignore non-compute nodes (NCNs) is turned off by default. Management nodes and NCNs are also not locked by default. The administrator must lock the NCNs to prevent unwanted actions from affecting these nodes.
This section only covers using locks with the Hardware State Manager (HSM). For more information on ignoring nodes, refer to the following sections:
The following actions can be prevented when a node is locked.
Doing any of these actions by accident will shut down a management node. If the node is a Kubernetes master or worker node, this can have serious negative effects on system operations. If a single node is taken down by mistake, it is possible that services will recover. If all management nodes are taken down, or all Kubernetes worker nodes are taken down by mistake, the system must be restarted.
After critical nodes are locked, power/reset (CAPMC) or firmware (FAS) operations cannot affect the nodes unless they are unlocked. For example, any locked node that is included in a list of nodes to be reset will result in a failure.
To best protect system health, NCNs should be locked as early as possible in the install/upgrade cycle. The later in the process, the more risk there is of accidentally taking down a critical node. NCN locking must be done after Kubernetes is running and the HSM service is operational.
Check whether HSM is running with the following command:
ncn# kubectl -n services get pods | grep smd
Example output:
cray-smd-848bcc875c-6wqsh 2/2 Running 0 9d
cray-smd-848bcc875c-hznqj 2/2 Running 0 9d
cray-smd-848bcc875c-tp6gf 2/2 Running 0 6d22h
cray-smd-init-2tnnq 0/2 Completed 0 9d
cray-smd-postgres-0 2/2 Running 0 19d
cray-smd-postgres-1 2/2 Running 0 6d21h
cray-smd-postgres-2 2/2 Running 0 19d
cray-smd-wait-for-postgres-4-7c78j 0/3 Completed 0 9d
The cray-smd
pods need to be in the Running
state, except for cray-smd-init
and
cray-smd-wait-for-postgres
which should be in Completed
state.
Any time a management NCN has to be power cycled, reset, or have its firmware updated, it will first need to be unlocked. After the operation is complete, the targeted nodes should once again be locked.
Use the cray hsm locks lock
command to perform locking.
NOTE: When locking NCNs, you must lock their NodeBMCs as well.
The processing-model rigid
parameter means that the operation must succeed on all
target nodes or the entire operation will fail.
Lock the management nodes.
ncn# cray hsm locks lock create --role Management --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 8
Success = 8
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s5b0n0", "x3000c0s4b0n0", "x3000c0s7b0n0", "x3000c0s6b0n0", "x3000c0s3b0n0", "x3000c0s2b0n0", "x3000c0s9b0n0", "x3000c0s8b0n0",]
Lock the NodeBMCs of those management nodes.
ncn# cray hsm locks lock create --component-ids \
$(cray hsm state components list --role management --type node --format json | jq '.Components[].ID' |
sed 's/n[0-9]*//;s/"//g' | tr '\n' ',' | sed 's/.$//')
Example output:
Failure = []
[Counts]
Total = 8
Success = 8
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s5b0", "x3000c0s4b0", "x3000c0s7b0", "x3000c0s6b0", "x3000c0s3b0", "x3000c0s2b0", "x3000c0s9b0", "x3000c0s8b0",]
Lock the management nodes.
ncn# cray hsm locks lock create --role Management --component-ids x3000c0s6b0n0 --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 1
Success = 1
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s6b0n0",]
Lock the BMC of those nodes.
Note: The BMC of
ncn-m001
typically does not exist in HSM under HSM State Components, and therefore cannot be locked.
Remove
n0
from all of the xnames to get a list of the NodeBMCs that need to be locked.
ncn# cray hsm locks lock create --component-ids x3000c0s6b0 --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 1
Success = 1
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s6b0",]
Use the cray hsm locks unlock
command to perform unlocking.
NOTE: When unlocking NCNs, you must unlock their NodeBMCs as well.
Unlock the management nodes.
ncn# cray hsm locks unlock create --role Management --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 8
Success = 8
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s7b0n0", "x3000c0s6b0n0", "x3000c0s3b0n0", "x3000c0s2b0n0", "x3000c0s9b0n0", "x3000c0s8b0n0", "x3000c0s5b0n0", "x3000c0s4b0n0",]
Unlock the BMCs of those management nodes.
ncn# cray hsm locks unlock create --component-ids \
$(cray hsm state components list --role management --type node --format json | jq '.Components[].ID' |
sed 's/n[0-9]*//;s/"//g' | tr '\n' ',' | sed 's/.$//')
Example output:
Failure = []
[Counts]
Total = 8
Success = 8
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s5b0", "x3000c0s4b0", "x3000c0s7b0", "x3000c0s6b0", "x3000c0s3b0", "x3000c0s2b0", "x3000c0s9b0", "x3000c0s8b0",]
Unlock the management nodes.
ncn# cray hsm locks unlock create --role Management --component-ids x3000c0s6b0n0 --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 1
Success = 1
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s6b0n0",]
Unlock the BMCs of those management nodes.
Remove
n0
from all of the xnames to get a list of the NodeBMCs that need to be locked.
ncn# cray hsm locks unlock create --component-ids x3000c0s6b0 --processing-model rigid
Example output:
Failure = []
[Counts]
Total = 1
Success = 1
Failure = 0
[Success]
ComponentIDs = [ "x3000c0s6b0",]