This page will detail cases for various hardware changes.
Malfunctioning or disabled hardware may need to be removed, or upgrades may want to be installed.
For either case, certain hardware requires that the node be shutdown prior to operations.
Component | Server Off | Rebuild Required |
---|---|---|
cpu | Yes | No |
ram | Yes | No |
OS disks | No | No 1 |
Ephemeral disks | No | Yes |
gpu | Yes | No |
nic | Yes | Yes |
NOTE: These instructions only apply prior to booting off the LiveCD – once that step is complete refer to the “Rebuild NCNs” section in the HPE Cray EX Hardware Management Administration Guide S-8015.
For operations that do not require a rebuild, a power off and cold boot will suffice.
STUB
This is a stub that requires code snippets to step-wise replace a storage node (short of rebuilding everything).
If the node can be powered off nicely by issuing a poweroff
command on the CLI, then it will evict its containers
and unmount etcd. On power-up it will re-join.
If the node is unresponsive, you can alert the cluster that you will be rebooting it by evicting the node:
linux# kubectl drain ncn-w002
Then you can reboot, and nicely tell the node to add the node back.
linux# kubectl uncordon ncn-w002
Swapping a node for an entirely new node mandates a “rebuild” (or a “build” if this is the first use).
STUB
This is a stub that requires code snippets to step-wise replace a storage node (short of rebuilding everything).
STUB
This is a stub that requires definition of constraints put on by the Ceph cluster when rebuilding nodes.
If etcd has met quorum, if there are 3 master nodes active, then etcd must expunge the node we are rebuilding.
STUB
This is a stub that requires code snippets to search-and-destroy OOM.
It is dangerous to run with 2 worker nodes or less, work must be done with diligence or pod clean-up will be necessary. Kubernetes pods will begin to throw Out-Of-Memory error after some time.
Drain the target worker node, issue the command from your laptop (if authenticated) or from an ingress node (such as ncn-m001):
linux# kubectl drain ncn-w002
linux# kubectl delete ncn-w002
Power down the node either in rack, or with ipmitool
export IPMI_PASSWORD=changeme
export username=root
ipmitool -I lanplus -U $username -E -H ncn-w002-mgmt power off
Now commence the operations on the node.
Once ready, power the node on in the rack or with ipmitool
export IPMI_PASSWORD=changeme
export username=root
ipmitool -I lanplus -U $username -E -H ncn-w002-mgmt power on
The node will netboot from sysmgmt services (kea/unbound/s3/bss).
The node will run cloud-init and will auto-join the cluster again. Monitor for status with:
linux# kubectl get nodes -w
Once your node returns to the cluster, the procedure is done.
Enable kdump on the NCNs after they are rebuilt.
STUB
This is a stub that requires code snippets to search-and-destroy OOM.
If replacing all OS disks then a rebuild is required. ↩︎