CRAY System Management - Guides and References > NCN Hardware Swaps

NCN Hardware Swaps

This page will detail cases for various hardware changes.

Node Components
Nodes
- Rebooting Servers
  - CEPH
  - Kubernetes
Safely Removing Nodes from Runtime
- Rebuilding CEPH NCNs
  - Quorum
- Rebuilding K8s NCNs
  - Master nodes
  - Worker nodes
Enable Kdump

Node Components

Malfunctioning or disabled hardware may need to be removed, or upgrades may want to be installed.

For either case, certain hardware requires that the node be shutdown prior to operations.

Component	Server Off	Rebuild Required
cpu	Yes	No
ram	Yes	No
OS disks	No	No ¹
Ephemeral disks	No	Yes
gpu	Yes	No
nic	Yes	Yes

NOTE: These instructions only apply prior to booting off the LiveCD – once that step is complete refer to the “Rebuild NCNs” section in the HPE Cray EX Hardware Management Administration Guide S-8015.

Rebooting Nodes

For operations that do not require a rebuild, a power off and cold boot will suffice.

CEPH

STUB This is a stub that requires code snippets to step-wise replace a storage node (short of rebuilding everything).

Kubernetes

If the node can be powered off nicely by issuing a poweroff command on the CLI, then it will evict its containers and unmount etcd. On power-up it will re-join.

If the node is unresponsive, you can alert the cluster that you will be rebooting it by evicting the node:

linux# kubectl drain ncn-w002

Then you can reboot, and nicely tell the node to add the node back.

linux# kubectl uncordon ncn-w002

Nodes

Swapping a node for an entirely new node mandates a “rebuild” (or a “build” if this is the first use).

Safely Removing Nodes from Runtime

Rebuilding CEPH NCNs

STUB This is a stub that requires code snippets to step-wise replace a storage node (short of rebuilding everything).

Quorum

STUB This is a stub that requires definition of constraints put on by the Ceph cluster when rebuilding nodes.

Rebuilding Kubernetes NCNs

Master Nodes

If etcd has met quorum, if there are 3 master nodes active, then etcd must expunge the node we are rebuilding.

Evict etcd:

STUB This is a stub that requires code snippets to search-and-destroy OOM.
Follow the procedure for worker nodes.

Worker Nodes

It is dangerous to run with 2 worker nodes or less, work must be done with diligence or pod clean-up will be necessary. Kubernetes pods will begin to throw Out-Of-Memory error after some time.

Drain the target worker node, issue the command from your laptop (if authenticated) or from an ingress node (such as ncn-m001):
```
linux# kubectl drain ncn-w002
linux# kubectl delete ncn-w002
```

Power down the node either in rack, or with ipmitool

export IPMI_PASSWORD=changeme
export username=root
ipmitool -I lanplus -U $username -E -H ncn-w002-mgmt power off

Now commence the operations on the node.

Once ready, power the node on in the rack or with ipmitool

export IPMI_PASSWORD=changeme
export username=root
ipmitool -I lanplus -U $username -E -H ncn-w002-mgmt power on

The node will netboot from sysmgmt services (kea/unbound/s3/bss).
The node will run cloud-init and will auto-join the cluster again. Monitor for status with:
```
linux# kubectl get nodes -w
```

Once your node returns to the cluster, the procedure is done.

Enable Kdump

Enable kdump on the NCNs after they are rebuilt.

Cleaning up Out-Of-Memory Pods

STUB This is a stub that requires code snippets to search-and-destroy OOM.

If replacing all OS disks then a rebuild is required. ↩︎