This procedure will remove a liquid-cooled blades from an HPE Cray EX system.
The Cray command line interface (CLI) tool is initialized and configured on the system. See Configure the Cray CLI.
Knowledge of whether Data Virtualization Service (DVS) is operating over the Node Management Network (NMN) or the High Speed Network (HSN).
The Slingshot fabric must be configured with the desired topology for desired state of the blades in the system.
The System Layout Service (SLS) must have the desired HSN configuration.
Check the status of the HSN and record link status before the procedure.
The blades must have the coolant drained and filled during the swap to minimize cross-contamination of cooling systems.
Use the workload manager (WLM) to drain running jobs from the affected nodes on the blade.
Refer to the vendor documentation for the WLM for more information.
(ncn-mw#
) Use Boot Orchestration Services (BOS) to shut down the affected nodes in the source blade.
In this example, x9000c3s0
is the source blade. Specify the appropriate component name (xname) and BOS
template for the node type in the following command.
BOS_TEMPLATE=cos-2.0.30-slurm-healthy-compute
cray bos v2 sessions create --template-name $BOS_TEMPLATE --operation shutdown --limit x9000c3s0b0n0,x9000c3s0b0n1,x9000c3s0b1n0,x9000c3s0b1n1
(ncn-mw#
) Set the environment variable SLOT
to the blade’s location.
SLOT="x9000c3s0"
(ncn-mw#
) Clear the Redfish event subscriptions.
export TOKEN=$(curl -s -S -d grant_type=client_credentials \
-d client_id=admin-client \
-d client_secret=`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token | jq -r '.access_token')
for BMC in $(cray hsm inventory redfishEndpoints list --type NodeBMC --format json | jq .RedfishEndpoints[].ID -r | grep ${SLOT}); do
/usr/share/doc/csm/scripts/operations/node_management/delete_bmc_subscriptions.py $BMC
done
Each BMC on the blade will have output like the following:
Clearing subscriptions from NodeBMC x3000c0s9b0
Retrieving BMC credentials from SCSD
Retrieving Redfish Event subscriptions from the BMC: https://x3000c0s9b0/redfish/v1/EventService/Subscriptions
Deleting event subscription: https://x3000c0s9b0/redfish/v1/EventService/Subscriptions/1
Successfully deleted https://x3000c0s9b0/redfish/v1/EventService/Subscriptions/1
(ncn-mw#
) Temporarily disable the Redfish endpoints for NodeBMCs
present in the blade.
cray hsm inventory redfishEndpoints update --enabled false x9000c3s0b0 --id x9000c3s0b0 --hostname x9000c3s0b0
cray hsm inventory redfishEndpoints update --enabled false x9000c3s0b1 --id x9000c3s0b1 --hostname x9000c3s0b1
(ncn-mw#
) Remove the system-specific settings from each node controller on the blade.
curl -k -u root:PASSWORD -X POST -H \
'Content-Type: application/json' -d '{"ResetType":"StatefulReset"}' \
https://x9000c3s0b0/redfish/v1/Managers/BMC/Actions/Manager.Reset
curl -k -u root:PASSWORD -X POST -H \
'Content-Type: application/json' -d '{"ResetType":"StatefulReset"}' \
https://x9000c3s0b1/redfish/v1/Managers/BMC/Actions/Manager.Reset
Use Ctrl-C to return to the prompt if command does not return.
(ncn-mw#
) Suspend the hms-discovery
cron job.
kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : true }}'
(ncn-mw#
) Verify that the hms-discovery
cron job has stopped (ACTIVE
= 0
and SUSPEND
= True
).
kubectl get cronjobs -n services hms-discovery
Example output:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hms-discovery */3 * * * * True 0 117s 15d
(ncn-mw#
) Power off the chassis slot.
This examples powers off slot 0, chassis 3, in cabinet 9000.
cray power transition off --xnames x9000c3s0 --include children
(ncn-mw#
) Disable the chassis slot.
Disabling the slot prevents hms-discovery
from automatically powering on the slot. This example disables slot 0, chassis 3, in cabinet 9000.
cray hsm state components enabled update --enabled false x9000c3s0
IMPORTANT: Record the NMN MAC and IP addresses for each node in the blade (labeled Node Maintenance Network
). To prevent disruption in DVS when over operating the NMN, these addresses must
be maintained in the HSM when the blade is swapped and discovered.
The NodeBMC
MAC and IP addresses are assigned algorithmically and must not be deleted from the HSM.
(ncn-mw#
) Skip this step if DVS is operating over the HSN, otherwise proceed with this step. Query HSM to determine the ComponentID
, MAC addresses, and IP addresses for each node in the blade.
The prerequisites show an example of how to gather HSM values and store them to a file.
cray hsm inventory ethernetInterfaces list --component-id x9000c3s0b0n0 --format json
Example output:
[
{
"ID": "0040a6836339",
"Description": "Node Maintenance Network",
"MACAddress": "00:40:a6:83:63:39",
"LastUpdate": "2021-04-09T21:51:04.662063Z",
"ComponentID": "x9000c3s0b0n0",
"Type": "Node",
"IPAddresses": [
{
"IPAddress": "10.100.0.10"
}
]
}
]
Record the following values for the blade:
`ComponentID: "x9000c3s0b0n0"`
`MACAddress: "00:40:a6:83:63:39"`
`IPAddress: "10.100.0.10"`
Repeat the command to record the ComponentID
, MAC addresses, and IP addresses for the Node Maintenance Network
for the other nodes in the blade.
(ncn-mw#
) Set an environment variable that corresponds to the chassis slot of the blade.
CHASSIS_SLOT=x9000c3s0
(ncn-mw#
) Delete the Redfish endpoints for each node.
for xname in $(cray hsm inventory redfishEndpoints list --format json |
jq -r --arg CHASSIS_SLOT "${CHASSIS_SLOT}" \
'.RedfishEndpoints[] | select(.ID | startswith($CHASSIS_SLOT)) | .ID')
do
echo "Removing $xname from HSM Inventory RedfishEndpoints"
cray hsm inventory redfishEndpoints delete "$xname"
done
(ncn-mw#
) Remove entries from the state components.
for xname in $(cray hsm state components list --format json |
jq -r --arg CHASSIS_SLOT "${CHASSIS_SLOT}" \
'.Components[] | select((.ID | startswith($CHASSIS_SLOT)) and (.ID != $CHASSIS_SLOT)) | .ID' )
do
echo "Removing $xname from HSM State components"
cray hsm state components delete "$xname"
done
(ncn-mw#
) Delete the NMN MAC and IP addresses each node in the blade from the HSM.
Do not delete the MAC and IP addresses for the node BMC.
for mac in $(cray hsm inventory ethernetInterfaces list --type Node --format json |
jq -r --arg CHASSIS_SLOT "${CHASSIS_SLOT}" \
'.[] | select(.ComponentID | startswith($CHASSIS_SLOT)) | .ID')
do
echo "Removing $mac from HSM Inventory EthernetInterfaces"
cray hsm inventory ethernetInterfaces delete "$mac"
done
(ncn-mw#
) Restart Kea.
ncn-mw# kubectl delete pods -n services -l app.kubernetes.io/name=cray-dhcp-kea
Remove the blade from the source location.
Drain the coolant from the blade and fill with fresh coolant to minimize cross-contamination of cooling systems.
Install the blade from the source system in a storage rack or leave it on the cart.
(ncn-mw#
) Determine the name of the Chassis BMC.
CHASSIS_BMC="$(echo $CHASSIS_SLOT | egrep -o 'x[0-9]+c[0-9]+')b0"
echo $CHASSIS_BMC
Example output:
x9000c3b0
(ncn-mw#
) Rediscover the Chassis BMC.
cray hsm inventory discover create --xnames $CHASSIS_BMC
hms-discovery
cronjob(ncn-mw#
) Un-suspend the hms-discovery
cron job if no more liquid-cooled blades are planned to be removed from the system.
kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : false }}'
(ncn-mw#
) Verify that the hms-discovery
cron job has stopped (ACTIVE
= 0
and SUSPEND
= False
).
kubectl get cronjobs -n services hms-discovery
Example output:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hms-discovery */3 * * * * False 1 46s 15d