Rebuild a master, worker, or storage non-compute node (NCN). Use this procedure in the event that a node has a hardware failure, or some other issue with the node has occurred that warrants rebuilding the node.
The system is fully installed and has transitioned off of the LiveCD.
(ncn#
) Variables set with the name of the node being rebuilt and its component name (xname) are required.
NODE
to the hostname of the node being rebuilt (e.g. ncn-w001
, ncn-w002
, etc).XNAME
to the component name (xname) of that node.NODE=ncn-w00n
XNAME=$(ssh $NODE cat /etc/cray/xname)
echo $XNAME
Only follow the steps in the section for the node type that is being rebuilt.
NOTE:
(ncn#
) Restart thegoss-servers
service on the rebuilt node after it has been rebuilt. This is necessary because of a timing issue that is fixed in CSM 1.6.1.ssh "${NODE}" 'systemctl restart goss-servers'
(ncn-m001#
) Run ncn-upgrade-worker-storage-nodes.sh
for ncn-w001
.
Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w001
NOTES:
- The
root
user password for the node may need to be reset after it is rebooted.- See Starting a new workflow after a failed workflow if this command fails and needs to be restarted.
Multiple workers can be upgraded simultaneously by passing them as a comma-separated list into the rebuild script.
In some cases, it is not possible to upgrade all workers in one request. It is system administrator’s responsibility to make sure that the following conditions are met:
If the system has more than five workers, then they cannot all be rebuilt with a single request.
In this case, the rebuild should be split into multiple requests, with each request specifying no more than five workers.
No single rebuild request should include all of the worker nodes that have DVS running on them. For High Availability, DVS requires at least two workers running DVS and CPS at all times.
When rebuilding worker nodes which are running DVS, it is not recommended to simultaneously reboot compute nodes. This is to avoid restarting DVS clients and servers at the same time.
(ncn-m001#
) An example of a single request to rebuild multiple worker nodes simultaneously:
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w002,ncn-w003,ncn-w004
Master node rebuilds require that the environment variables CSM_RELEASE
and CSM_ARTI_DIR
be set on the node where the rebuild script is executed.
(ncn-m#
) Set the CSM_RELEASE
and CSM_ARTI_DIR
environment variables. Replace 1.4.0
with the correct CSM release version:
export CSM_RELEASE=1.4.0
export CSM_ARTI_DIR="/etc/cray/upgrade/csm/csm-${CSM_RELEASE}/tarball/csm-${CSM_RELEASE}"
NOTES:
- If the
/etc/cray/upgrade/csm/
directory is empty, create an empty directory at the same path. Download and extract CSM tarball to that directory.- Update the value of
CSM_ARTI_DIR
with the newly created directory above.- Ensure the
/etc/cray/upgrade/csm/
directory isceph
mount using the command below (its output should showceph
as the type):
mount | grep /etc/cray/upgrade/csm
- Steps to download CSM tarball are at Update Product Stream.
- If Kubernetes encryption has been enabled via the Kubernetes Encryption Documentation, then backup the
/etc/cray/kubernetes/encryption
directory on the master node before upgrading. The directory needs to be restored after the node has been rebuilt and thekube-apiserver
on the node should be restarted. See Kuberneteskube-apiserver
Failing for details on how to restart thekube-apiserver
.- This script should be run from
ncn-m001
when rebuildingncn-m002
orncn-m003
.
(ncn-m#
) Rebuild the desired master node. Replace ncn-m002
with the desired node to rebuild:
/usr/share/doc/csm/upgrade/scripts/rebuild/ncn-rebuild-master-nodes.sh ncn-m002
NOTES:
- This script should be run from
ncn-m002
when rebuildingncn-m001
.- This script should be run from
ncn-m001
when rebuildingncn-m002
orncn-m003
.
Follow each step below:
Restore any configurations for the node that are not automatically performed by CFS live node personalization. For example, SSH configuration files.
After completing all of the steps, run the Final Validation steps.