Cray System Management Documentation > Cray System Management (CSM) Administration Guide > node management > Replace a Compute Blade

Replace a Compute Blade

Replace an HPE Cray EX liquid-cooled compute blade.

Shutdown software and power off the blade

Temporarily disable endpoint discovery service (MEDS) for the compute nodes(s) being replaced. This example disables MEDS for the compute node in cabinet 1000, chassis 3, slot 0 (x1000c3s0b0). If there is more than 1 node card, in the blade specify each node card (x1000c3s0b0,x1000c3s0b1).
```
ncn-mw# cray hsm inventory redfishEndpoints update --enabled false x1000c3s0b0
```
Verify that the workload manager (WLM) is not using the affected nodes.
Use Boot Orchestration Services (BOS) to shut down the affected nodes. Specify the appropriate BOS template for the node type.
```
ncn-mw# cray bos session create --template-uuid BOS_TEMPLATE \
             --operation shutdown --limit x1000c3s0b0n0,x1000c3s0b0n1,x1000c3s0b1n0,x1000c3s0b1n1
```
Specify all the nodes in the blade using a comma-separated list. This example shows the command to shut down an EX425 compute blade (Windom) in cabinet 1000, chassis 3, slot 5. This blade type includes two node cards, each with two logical nodes (4 processors).
Disable the chassis slot in the Hardware State Manager (HSM).

This example shows cabinet 1000, chassis 3, slot 0 (x1000c3s0).
```
ncn-mw# cray hsm state components enabled update --enabled false x1000c3s0
```
Disabling the slot prevents hms-discovery from attempting to automatically power on slots. If the slot automatically powers on after using CAPMC to power the slot off, then temporarily suspend the hms-discovery cron job in Kubernetes:
1. Suspend the hms-discovery cron job to prevent slot power on.
```
ncn-mw# kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : true }}'
```
2. Verify that the hms-discovery cron job has stopped (ACTIVE column = 0).
```
ncn-mw# kubectl get cronjobs -n services hms-discovery
```
  Example output:
```
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE^M
hms-discovery */3 * * * * True 0 117s 15d
```

Use CAPMC to power off slot 0 in chassis 3.

ncn-mw# cray capmc xname_off create --xnames x1000c3s0 --recursive true --format json

Delete the HSM entries

Delete the node Ethernet interface MAC addresses and the Redfish endpoint from the Hardware State Manager (HSM).

IMPORTANT: The HSM stores the node’s BMC NIC MAC addresses for the hardware management network and the node’s Ethernet NIC MAC addresses for the node management network. The MAC addresses for the node NICs must be updated in the DHCP/DNS configuration when a liquid-cooled blade is replaced. Their entries must be deleted from the HSM Ethernet interfaces table and be rediscovered. The BMC NIC MAC addresses for liquid-cooled blades are assigned algorithmically and should not be deleted from the HSM.

For each node delete the node’s NIC MAC addresses from the HSM Ethernet interfaces table.

Query HSM to determine the node’s NIC MAC addresses associated with the blade in cabinet 1000, chassis 3, slot 0, node card 0, node 0.

ncn-mw# cray hsm inventory ethernetInterfaces list --component-id x1000c3s0b0n0 --format json

Example output:

  [
      {
          "ID": "b42e99be1a2b",
          "Description": "Ethernet Interface Lan1",
          "MACAddress": "b4:2e:99:be:1a:2b",
          "LastUpdate": "2021-01-27T00:07:08.658927Z",
          "ComponentID": "x1000c3s0b0n0",
          "Type": "Node",
          "IPAddresses": [
          {
              "IPAddress": "10.252.1.26"
          }
          ]
      },
      {
          "ID": "b42e99be1a2c",
          "Description": "Ethernet Interface Lan2",
          "MACAddress": "b4:2e:99:be:1a:2c",
          "LastUpdate": "2021-01-26T22:43:10.593193Z",
          "ComponentID": "x1000c3s0b0n0",
          "Type": "Node",
          "IPAddresses": []
      }
  ]

Delete each node’s NIC MAC address in the Hardware State Manager (HSM) Ethernet interfaces table.

ncn-mw# cray hsm inventory ethernetInterfaces delete b42e99be1a2b
ncn-mw# cray hsm inventory ethernetInterfaces delete b42e99be1a2c

Delete the Redfish endpoint for the removed node.

Replace the blade hardware.

Review the Remove a Compute Blade Using the Lift procedure in HPE Cray EX Hardware Replacement Procedures H-6173 at HPE Support for detailed instructions.

CAUTION: Always power off the chassis slot or device before removal. The best practice is to unlatch and unseat the device while the coolant hoses are still connected, then disconnect the coolant hoses. If this is not possible, disconnect the coolant hoses, then quickly unlatch/unseat the device (within 10 seconds). Failure to do so may damage the equipment.

Power on and boot the compute nodes

Un-suspend the hms-discovery cronjob in Kubernetes.

ncn-mw# kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : false }}'
ncn-mw# kubectl get cronjobs.batch -n services hms-discovery

Example output:

NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hms-discovery */3 * * * * False 1 41s 33d

ncn-mw# kubectl -n services logs hms-discovery-1600117560-5w95d hms-discovery | grep "Mountain discovery finished" | jq '.discoveredXnames'

Example output:

[
"x1000c3s0b0"
]

Enable MEDS for the compute nodes in the blade.

ncn-mw# cray hsm inventory redfishEndpoints update --enabled true --rediscover-on-update true --format toml

The updated component names (xnames) will be returned. Example output:

x1000c3s0b0

Wait for 3-5 minutes for the blade to power on and the node BMCs to be discovered.

Verify that the affected nodes are enabled in the HSM.

ncn-mw# cray hsm state components describe x1000c3s0b0n0 --format toml

Beginning of example output:

Type = "Node"
Enabled = true
State = "Off"

Verify the BMCs have been discovered by the HSM.

ncn-mw# cray hsm inventory redfishEndpoints describe x1000c3s0b0 --format json

Example output:

    {
        "ID": "x1000c3s0b0",
        "Type": "NodeBMC",
        "Hostname": "x1000c3s0b0",
        "Domain": "",
        "FQDN": "x1000c3s0b0",
        "Enabled": true,
        "UUID": "e005dd6e-debf-0010-e803-b42e99be1a2d",
        "User": "root",
        "Password": "",
        "MACAddr": "b42e99be1a2d",
        "RediscoverOnUpdate": true,
        "DiscoveryInfo": {
            "LastDiscoveryAttempt": "2021-01-29T16:15:37.643327Z",
            "LastDiscoveryStatus": "DiscoverOK",
            "RedfishVersion": "1.7.0"
        }
    }

When LastDiscoveryStatus displays as DiscoverOK, the node BMC has been successfully discovered.
If the last discovery state is DiscoveryStarted then the BMC is currently being inventoried by HSM.
If the last discovery state is HTTPsGetFailed or ChildVerificationFailed, then an error has occurred during the discovery process.

Enable each node individually in the HSM database (in this example, the nodes are x1000c3s0b0n0-n3).
Optional: Force rediscovery of the components in the chassis (the example shows cabinet 1000, chassis 3).
```
ncn-mw# cray hsm inventory discover create --xnames x1000c3
```

Optional: Verify that discovery has completed (LastDiscoveryStatus = “DiscoverOK”).

ncn-mw# cray hsm inventory redfishEndpoints describe x1000c3 --format toml

Example output:

Type = "ChassisBMC"
Domain = ""
MACAddr = "02:13:88:03:00:00"
Enabled = true
Hostname = "x1000c3"
RediscoverOnUpdate = true
FQDN = "x1000c3"
User = "root"
Password = ""
IPAddress = "10.104.0.76"
ID = "x1000c3b0"
[DiscoveryInfo]
LastDiscoveryAttempt = "2020-09-03T19:03:47.989621Z"
RedfishVersion = "1.2.0"
LastDiscoveryStatus = "DiscoverOK"

Verify that the correct firmware versions for node BIOS, node controller (nC), NIC mezzanine card (NMC), GPUs, and so on.
Optional: If necessary, update the firmware. Review the Firmware Action Service (FAS) documentation.
```
ncn-mw# cray fas actions create CUSTOM_DEVICE_PARAMETERS.json
```

Update the System Layout Service (SLS).

Dump the existing SLS configuration.

ncn-mw# cray sls networks describe HSN --format=json > existingHSN.json

Copy existingHSN.json to a newHSN.json, edit newHSN.json with the changes, then run:

ncn-mw# curl -s -k -H "Authorization: Bearer ${TOKEN}" https://API_SYSTEM/apis/sls/v1/networks/HSN -X PUT -d @newHSN.json

Reload DVS on NCNs.

Use boot orchestration to power on and boot the nodes.

Specify the appropriate BOS template for the node type.

ncn-mw# cray bos session create --template-uuid BOS_TEMPLATE --operation reboot \
            --limit x1000c3s0b0n0,x1000c3s0b0n1,x1000c3s0b1n0,x1000c3s0b1n1