Replace an HPE Cray EX liquid-cooled compute blade.
Temporarily disable endpoint discovery service (MEDS) for the compute nodes(s) being replaced.
This example disables MEDS for the compute node in cabinet 1000, chassis 3, slot 0 (x1000c3s0b0
). If there is more than 1 node card, in the blade specify each node card (x1000c3s0b0,x1000c3s0b1
).
ncn-mw# cray hsm inventory redfishEndpoints update --enabled false x1000c3s0b0
Verify that the workload manager (WLM) is not using the affected nodes.
Use Boot Orchestration Services (BOS) to shut down the affected nodes. Specify the appropriate BOS template for the node type.
ncn-mw# cray bos session create --template-uuid BOS_TEMPLATE \
--operation shutdown --limit x1000c3s0b0n0,x1000c3s0b0n1,x1000c3s0b1n0,x1000c3s0b1n1
Specify all the nodes in the blade using a comma-separated list. This example shows the command to shut down an EX425 compute blade (Windom) in cabinet 1000, chassis 3, slot 5. This blade type includes two node cards, each with two logical nodes (4 processors).
Disable the chassis slot in the Hardware State Manager (HSM).
This example shows cabinet 1000, chassis 3, slot 0 (x1000c3s0
).
ncn-mw# cray hsm state components enabled update --enabled false x1000c3s0
Disabling the slot prevents hms-discovery
from attempting to automatically power on slots. If the slot
automatically powers on after using CAPMC to power the slot off, then temporarily suspend the hms-discovery
cron job in Kubernetes:
Suspend the hms-discovery
cron job to prevent slot power on.
ncn-mw# kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : true }}'
Verify that the hms-discovery
cron job has stopped (ACTIVE column = 0).
ncn-mw# kubectl get cronjobs -n services hms-discovery
Example output:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE^M
hms-discovery */3 * * * * True 0 117s 15d
Use CAPMC to power off slot 0 in chassis 3.
ncn-mw# cray capmc xname_off create --xnames x1000c3s0 --recursive true --format json
Delete the node Ethernet interface MAC addresses and the Redfish endpoint from the Hardware State Manager (HSM).
IMPORTANT: The HSM stores the node’s BMC NIC MAC addresses for the hardware management network and the node’s Ethernet NIC MAC addresses for the node management network. The MAC addresses for the node NICs must be updated in the DHCP/DNS configuration when a liquid-cooled blade is replaced. Their entries must be deleted from the HSM Ethernet interfaces table and be rediscovered. The BMC NIC MAC addresses for liquid-cooled blades are assigned algorithmically and should not be deleted from the HSM.
For each node delete the node’s NIC MAC addresses from the HSM Ethernet interfaces table.
Query HSM to determine the node’s NIC MAC addresses associated with the blade in cabinet 1000, chassis 3, slot 0, node card 0, node 0.
ncn-mw# cray hsm inventory ethernetInterfaces list --component-id x1000c3s0b0n0 --format json
Example output:
[
{
"ID": "b42e99be1a2b",
"Description": "Ethernet Interface Lan1",
"MACAddress": "b4:2e:99:be:1a:2b",
"LastUpdate": "2021-01-27T00:07:08.658927Z",
"ComponentID": "x1000c3s0b0n0",
"Type": "Node",
"IPAddresses": [
{
"IPAddress": "10.252.1.26"
}
]
},
{
"ID": "b42e99be1a2c",
"Description": "Ethernet Interface Lan2",
"MACAddress": "b4:2e:99:be:1a:2c",
"LastUpdate": "2021-01-26T22:43:10.593193Z",
"ComponentID": "x1000c3s0b0n0",
"Type": "Node",
"IPAddresses": []
}
]
Delete each node’s NIC MAC address in the Hardware State Manager (HSM) Ethernet interfaces table.
ncn-mw# cray hsm inventory ethernetInterfaces delete b42e99be1a2b
ncn-mw# cray hsm inventory ethernetInterfaces delete b42e99be1a2c
Delete the Redfish endpoint for the removed node.
Replace the blade hardware.
Review the Remove a Compute Blade Using the Lift procedure in HPE Cray EX Hardware Replacement Procedures H-6173 at HPE Support for detailed instructions.
CAUTION: Always power off the chassis slot or device before removal. The best practice is to unlatch and unseat the device while the coolant hoses are still connected, then disconnect the coolant hoses. If this is not possible, disconnect the coolant hoses, then quickly unlatch/unseat the device (within 10 seconds). Failure to do so may damage the equipment.
Un-suspend the hms-discovery
cronjob in Kubernetes.
ncn-mw# kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : false }}'
ncn-mw# kubectl get cronjobs.batch -n services hms-discovery
Example output:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hms-discovery */3 * * * * False 1 41s 33d
ncn-mw# kubectl -n services logs hms-discovery-1600117560-5w95d hms-discovery | grep "Mountain discovery finished" | jq '.discoveredXnames'
Example output:
[
"x1000c3s0b0"
]
Enable MEDS for the compute nodes in the blade.
ncn-mw# cray hsm inventory redfishEndpoints update --enabled true --rediscover-on-update true --format toml
The updated component names (xnames) will be returned. Example output:
x1000c3s0b0
Wait for 3-5 minutes for the blade to power on and the node BMCs to be discovered.
Verify that the affected nodes are enabled in the HSM.
ncn-mw# cray hsm state components describe x1000c3s0b0n0 --format toml
Beginning of example output:
Type = "Node"
Enabled = true
State = "Off"
Verify the BMCs have been discovered by the HSM.
ncn-mw# cray hsm inventory redfishEndpoints describe x1000c3s0b0 --format json
Example output:
{
"ID": "x1000c3s0b0",
"Type": "NodeBMC",
"Hostname": "x1000c3s0b0",
"Domain": "",
"FQDN": "x1000c3s0b0",
"Enabled": true,
"UUID": "e005dd6e-debf-0010-e803-b42e99be1a2d",
"User": "root",
"Password": "",
"MACAddr": "b42e99be1a2d",
"RediscoverOnUpdate": true,
"DiscoveryInfo": {
"LastDiscoveryAttempt": "2021-01-29T16:15:37.643327Z",
"LastDiscoveryStatus": "DiscoverOK",
"RedfishVersion": "1.7.0"
}
}
LastDiscoveryStatus
displays as DiscoverOK
, the node BMC has been successfully discovered.DiscoveryStarted
then the BMC is currently being inventoried by HSM.HTTPsGetFailed
or ChildVerificationFailed
, then an error has
occurred during the discovery process.Enable each node individually in the HSM database (in this example, the nodes are x1000c3s0b0n0-n3
).
Optional: Force rediscovery of the components in the chassis (the example shows cabinet 1000, chassis 3).
ncn-mw# cray hsm inventory discover create --xnames x1000c3
Optional: Verify that discovery has completed (LastDiscoveryStatus
= “DiscoverOK
”).
ncn-mw# cray hsm inventory redfishEndpoints describe x1000c3 --format toml
Example output:
Type = "ChassisBMC"
Domain = ""
MACAddr = "02:13:88:03:00:00"
Enabled = true
Hostname = "x1000c3"
RediscoverOnUpdate = true
FQDN = "x1000c3"
User = "root"
Password = ""
IPAddress = "10.104.0.76"
ID = "x1000c3b0"
[DiscoveryInfo]
LastDiscoveryAttempt = "2020-09-03T19:03:47.989621Z"
RedfishVersion = "1.2.0"
LastDiscoveryStatus = "DiscoverOK"
Verify that the correct firmware versions for node BIOS, node controller (nC), NIC mezzanine card (NMC), GPUs, and so on.
Optional: If necessary, update the firmware. Review the Firmware Action Service (FAS) documentation.
ncn-mw# cray fas actions create CUSTOM_DEVICE_PARAMETERS.json
Update the System Layout Service (SLS).
Dump the existing SLS configuration.
ncn-mw# cray sls networks describe HSN --format=json > existingHSN.json
Copy existingHSN.json
to a newHSN.json
, edit newHSN.json
with the changes, then run:
ncn-mw# curl -s -k -H "Authorization: Bearer ${TOKEN}" https://API_SYSTEM/apis/sls/v1/networks/HSN -X PUT -d @newHSN.json
Reload DVS on NCNs.
Use boot orchestration to power on and boot the nodes.
Specify the appropriate BOS template for the node type.
ncn-mw# cray bos session create --template-uuid BOS_TEMPLATE --operation reboot \
--limit x1000c3s0b0n0,x1000c3s0b0n1,x1000c3s0b1n0,x1000c3s0b1n1