Cray System Management Documentation > Cray System Management (CSM) Administration Guide > node management > Removing a Liquid-cooled blade from a System

Removing a Liquid-cooled blade from a System

This procedure will remove a liquid-cooled blades from an HPE Cray EX system.

Perquisites

The Cray command line interface (CLI) tool is initialized and configured on the system. See Configure the Cray CLI.
Knowledge of whether Data Virtualization Service (DVS) is operating over the Node Management Network (NMN) or the High Speed Network (HSN).
The Slingshot fabric must be configured with the desired topology for desired state of the blades in the system.
The System Layout Service (SLS) must have the desired HSN configuration.
Check the status of the HSN and record link status before the procedure.
The blades must have the coolant drained and filled during the swap to minimize cross-contamination of cooling systems.
- Review procedures in HPE Cray EX Coolant Service Procedures H-6199
- Review the HPE Cray EX Hand Pump User Guide H-6200

Procedure

1. Prepare the source system blade for removal

Use the workload manager (WLM) to drain running jobs from the affected nodes on the blade.

Refer to the vendor documentation for the WLM for more information.
(ncn-mw#) Use Boot Orchestration Services (BOS) to shut down the affected nodes in the source blade.

In this example, x9000c3s0 is the source blade. Specify the appropriate component name (xname) and BOS template for the node type in the following command.
```
BOS_TEMPLATE=cos-2.0.30-slurm-healthy-compute
cray bos v1 session create --template-name $BOS_TEMPLATE --operation shutdown --limit x9000c3s0b0n0,x9000c3s0b0n1,x9000c3s0b1n0,x9000c3s0b1n1
```

2. Disable the Redfish endpoints for the nodes

(ncn-mw#) Temporarily disable the Redfish endpoints for NodeBMCs present in the blade.

cray hsm inventory redfishEndpoints update --enabled false x9000c3s0b0 --id x9000c3s0b0
cray hsm inventory redfishEndpoints update --enabled false x9000c3s0b1 --id x9000c3s0b1

3. Clear Redfish event subscriptions from BMCs on the blade

(ncn-mw#) Set the environment variable SLOT to the blade’s location.
```
SLOT="x9000c3s0"
```

(ncn-mw#) Clear the Redfish event subscriptions.

export TOKEN=$(curl -s -S -d grant_type=client_credentials \
        -d client_id=admin-client \
        -d client_secret=`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d` \
        https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token | jq -r '.access_token')

for BMC in $(cray hsm inventory  redfishEndpoints list --type NodeBMC --format json | jq .RedfishEndpoints[].ID -r | grep ${SLOT}); do
    /usr/share/doc/csm/scripts/operations/node_management/delete_bmc_subscriptions.py $BMC
done

Each BMC on the blade will have output like the following:

Clearing subscriptions from NodeBMC x3000c0s9b0
Retrieving BMC credentials from SCSD
Retrieving Redfish Event subscriptions from the BMC: https://x3000c0s9b0/redfish/v1/EventService/Subscriptions
Deleting event subscription: https://x3000c0s9b0/redfish/v1/EventService/Subscriptions/1
Successfully deleted https://x3000c0s9b0/redfish/v1/EventService/Subscriptions/1

4. Clear the node controller settings

(ncn-mw#) Remove the system-specific settings from each node controller on the blade.

curl -k -u root:PASSWORD -X POST -H \
  'Content-Type: application/json' -d '{"ResetType":"StatefulReset"}' \
  https://x9000c3s0b0/redfish/v1/Managers/BMC/Actions/Manager.Reset

curl -k -u root:PASSWORD -X POST -H \
  'Content-Type: application/json' -d '{"ResetType":"StatefulReset"}' \
  https://x9000c3s0b1/redfish/v1/Managers/BMC/Actions/Manager.Reset

Use Ctrl-C to return to the prompt if command does not return.

5. Power off the chassis slot

(ncn-mw#) Suspend the hms-discovery cron job.

kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : true }}'

(ncn-mw#) Verify that the hms-discovery cron job has stopped (ACTIVE = 0 and SUSPEND = True).

kubectl get cronjobs -n services hms-discovery

Example output:

NAME             SCHEDULE        SUSPEND     ACTIVE   LAST   SCHEDULE  AGE
hms-discovery    */3 * * * *     True         0       117s             15d

(ncn-mw#) Power off the chassis slot.

This examples powers off slot 0, chassis 3, in cabinet 9000.
```
cray capmc xname_off create --xnames x9000c3s0 --recursive true
```

6. Disable the chassis slot

(ncn-mw#) Disable the chassis slot.

Disabling the slot prevents hms-discovery from automatically powering on the slot. This example disables slot 0, chassis 3, in cabinet 9000.
```
cray hsm state components enabled update --enabled false x9000c3s0 --id x9000c3s0
```

7. Record MAC and IP addresses for nodes

IMPORTANT: Record the NMN MAC and IP addresses for each node in the blade (labeled Node Maintenance Network). To prevent disruption in DVS when over operating the NMN, these addresses must be maintained in the HSM when the blade is swapped and discovered.

The NodeBMC MAC and IP addresses are assigned algorithmically and must not be deleted from the HSM.

(ncn-mw#) Skip this step if DVS is operating over the HSN, otherwise proceed with this step. Query HSM to determine the ComponentID, MAC addresses, and IP addresses for each node in the blade.

The prerequisites show an example of how to gather HSM values and store them to a file.

cray hsm inventory ethernetInterfaces list --component-id x9000c3s0b0n0 --format json

Example output:

[
  {
    "ID": "0040a6836339",
    "Description": "Node Maintenance Network",
    "MACAddress": "00:40:a6:83:63:39",
    "LastUpdate": "2021-04-09T21:51:04.662063Z",
    "ComponentID": "x9000c3s0b0n0",
    "Type": "Node",
    "IPAddresses": [
      {
        "IPAddress": "10.100.0.10"
      }
    ]
  }
]

Record the following values for the blade:

`ComponentID: "x9000c3s0b0n0"`
`MACAddress: "00:40:a6:83:63:39"`
`IPAddress: "10.100.0.10"`

Repeat the command to record the ComponentID, MAC addresses, and IP addresses for the Node Maintenance Network for the other nodes in the blade.

8. Cleanup Hardware State Manager

(ncn-mw#) Set an environment variable that corresponds to the chassis slot of the blade.
```
CHASSIS_SLOT=x9000c3s0
```

(ncn-mw#) Delete the Redfish endpoints for each node.

for xname in $(cray hsm inventory redfishEndpoints list --format json |
                 jq -r --arg CHASSIS_SLOT "${CHASSIS_SLOT}" \
                   '.RedfishEndpoints[] | select(.ID | startswith($CHASSIS_SLOT)) | .ID')
do
    echo "Removing $xname from HSM Inventory RedfishEndpoints"
    cray hsm inventory redfishEndpoints delete "$xname"
done

(ncn-mw#) Remove entries from the state components.

for xname in $(cray hsm state components list --format json |
                 jq -r --arg CHASSIS_SLOT "${CHASSIS_SLOT}" \
                   '.Components[] | select((.ID | startswith($CHASSIS_SLOT)) and (.ID != $CHASSIS_SLOT)) | .ID' )
do
    echo "Removing $xname from HSM State components"
    cray hsm state components delete "$xname"
done

(ncn-mw#) Delete the NMN MAC and IP addresses each node in the blade from the HSM.

Do not delete the MAC and IP addresses for the node BMC.

for mac in $(cray hsm inventory ethernetInterfaces list --type Node --format json |
               jq -r --arg CHASSIS_SLOT "${CHASSIS_SLOT}" \
                 '.[] | select(.ComponentID | startswith($CHASSIS_SLOT)) | .ID')
do
    echo "Removing $mac from HSM Inventory EthernetInterfaces"
    cray hsm inventory ethernetInterfaces delete "$mac"
done

(ncn-mw#) Restart Kea.

ncn-mw# kubectl delete pods -n services -l app.kubernetes.io/name=cray-dhcp-kea

9. Remove the blade

Remove the blade from the source location.
- Review the Remove a Compute Blade Using the Lift procedure in HPE Cray EX Hardware Replacement Procedures H-6173 for detailed instructions for replacing liquid-cooled blades. These procedures can be found on the HPE Support Center.
Drain the coolant from the blade and fill with fresh coolant to minimize cross-contamination of cooling systems.
- Review HPE Cray EX Coolant Service Procedures H-6199. If using the hand pump, then review procedures in the HPE Cray EX Hand Pump User Guide H-6200. These procedures can be found on the HPE Support Center.
Install the blade from the source system in a storage rack or leave it on the cart.

10. Rediscover the Chassis BMC of the chassis the blade was removed from

(ncn-mw#) Determine the name of the Chassis BMC.

CHASSIS_BMC="$(echo $CHASSIS_SLOT | egrep -o 'x[0-9]+c[0-9]+')b0"
echo $CHASSIS_BMC

Example output:

x9000c3b0

(ncn-mw#) Rediscover the Chassis BMC.

cray hsm inventory discover create --xnames $CHASSIS_BMC

11. Re-enable the `hms-discovery` cronjob

(ncn-mw#) Un-suspend the hms-discovery cron job if no more liquid-cooled blades are planned to be removed from the system.
```
kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : false }}'
```

(ncn-mw#) Verify that the hms-discovery cron job has stopped (ACTIVE = 0 and SUSPEND = False).

kubectl get cronjobs -n services hms-discovery

Example output:

NAME            SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
hms-discovery   */3 * * * *   False     1        46s             15d

Removing a Liquid-cooled blade from a System

Perquisites

Procedure

1. Prepare the source system blade for removal

2. Disable the Redfish endpoints for the nodes

3. Clear Redfish event subscriptions from BMCs on the blade

4. Clear the node controller settings

5. Power off the chassis slot

6. Disable the chassis slot

7. Record MAC and IP addresses for nodes

8. Cleanup Hardware State Manager

9. Remove the blade

10. Rediscover the Chassis BMC of the chassis the blade was removed from

11. Re-enable the hms-discovery cronjob

11. Re-enable the `hms-discovery` cronjob