Rebuild a master, worker, or storage non-compute node (NCN). Use this procedure in the event that a node has a hardware failure, or some other issue with the node has occurred that warrants rebuilding the node.
The system is fully installed and has transitioned off of the LiveCD.
(ncn#) Variables set with the name of the node being rebuilt and its component name (xname) are required.
NODE to the hostname of the node being rebuilt (e.g. ncn-w001, ncn-w002, etc).XNAME to the component name (xname) of that node.NODE=ncn-w00n
XNAME=$(ssh $NODE cat /etc/cray/xname)
echo $XNAME
Only follow the steps in the section for the node type that is being rebuilt.
NOTE:(ncn#) Restart thegoss-serversservice on the rebuilt node after it has been rebuilt. This is necessary because of a timing issue that is fixed in CSM 1.6.1.ssh "${NODE}" 'systemctl restart goss-servers'
(ncn-m001#) Run ncn-upgrade-worker-storage-nodes.sh for ncn-w001.
Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w001
NOTES:
- The
rootuser password for the node may need to be reset after it is rebooted.- See Starting a new workflow after a failed workflow if this command fails and needs to be restarted.
Multiple workers can be upgraded simultaneously by passing them as a comma-separated list into the rebuild script.
In some cases, it is not possible to upgrade all workers in one request. It is system administrator’s responsibility to make sure that the following conditions are met:
If the system has more than five workers, then they cannot all be rebuilt with a single request.
In this case, the rebuild should be split into multiple requests, with each request specifying no more than five workers.
No single rebuild request should include all of the worker nodes that have DVS running on them. For High Availability, DVS requires at least two workers running DVS and CPS at all times.
When rebuilding worker nodes which are running DVS, it is not recommended to simultaneously reboot compute nodes. This is to avoid restarting DVS clients and servers at the same time.
(ncn-m001#) An example of a single request to rebuild multiple worker nodes simultaneously:
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w002,ncn-w003,ncn-w004
Master node rebuilds require that the environment variables CSM_RELEASE and CSM_ARTI_DIR be set on the node where the rebuild script is executed.
(ncn-m#) Set the CSM_RELEASE and CSM_ARTI_DIR environment variables. Replace 1.4.0 with the correct CSM release version:
export CSM_RELEASE=1.4.0
export CSM_ARTI_DIR="/etc/cray/upgrade/csm/csm-${CSM_RELEASE}/tarball/csm-${CSM_RELEASE}"
NOTES:
- If the
/etc/cray/upgrade/csm/directory is empty, create an empty directory at the same path. Download and extract CSM tarball to that directory.- Update the value of
CSM_ARTI_DIRwith the newly created directory above.- Ensure the
/etc/cray/upgrade/csm/directory iscephmount using the command below (its output should showcephas the type):
mount | grep /etc/cray/upgrade/csm
- Steps to download CSM tarball are at Update Product Stream.
- If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the
/etc/cray/kubernetes/encryptiondirectory on the master node before upgrading. The directory needs to be restored after the node has been rebuilt and thekube-apiserveron the node should be restarted. See Kuberneteskube-apiserverFailing for details on how to restart thekube-apiserver.- This script should be run from
ncn-m001when rebuildingncn-m002orncn-m003.
(ncn-m#) Rebuild the desired master node. Replace ncn-m002 with the desired node to rebuild:
/usr/share/doc/csm/upgrade/scripts/rebuild/ncn-rebuild-master-nodes.sh ncn-m002
NOTES:
- This script should be run from
ncn-m002when rebuildingncn-m001.- This script should be run from
ncn-m001when rebuildingncn-m002orncn-m003.
Follow each step below:
Restore any configurations for the node that are not automatically performed by CFS live node personalization. For example, SSH configuration files.
After completing all of the steps, run the Final Validation steps.