When an etcd cluster is not healthy, it needs to be rebuilt. During that process, the pods that rely on etcd clusters lose data. That data needs to be repopulated in order for the cluster to go back to a healthy state.
The following services need their data repopulated in the etcd cluster:
An etcd cluster was rebuilt. See Rebuild Unhealthy etcd Clusters.
Reconstruct boot session templates for impacted product streams to repopulate data.
Boot preparation information for other product streams can be found in the following locations:
Restore BSS from the ETCD backup see Restore an ETCD Cluster from a Backup
Reload the firmware images from Nexus.
Refer to the Load Firmware from Nexus
section in FAS Admin Procedures for more information.
When the etcd cluster is rebuilt, all historic data for firmware actions and all recorded snapshots will be lost.
Image data will be reloaded from Nexus.
Any images that were loaded into FAS outside of Nexus will need to be reloaded using the Load Firmware from RPM or ZIP file
section in
FAS Admin Procedures.
After images are reloaded, any running actions at time of failure will need to be recreated.
Resubscribe the compute nodes and any NCNs that use the ORCA daemon for their State Change Notifications (SCN).
(ncn-m#
) Resubscribe all compute nodes.
TMPFILE=$(mktemp)
sat status --no-borders --no-headings | grep Ready | grep Compute | awk '{printf("nid%06d-nmn\n",$4);}' > "${TMPFILE}"
pdsh -w ^"${TMPFILE}" "systemctl restart cray-orca"
rm -rf "${TMPFILE}"
(ncn-m#
) Resubscribe all worker nodes.
NOTE: Modify the -w
arguments in the following commands to reflect the number of worker nodes in the system.
pdsh -w ncn-w00[1-4]-can.local "systemctl restart cray-orca"