Use an existing backup of a healthy etcd cluster to restore an unhealthy cluster to a healthy state.
The commands in this procedure can be run on any master node (ncn-mXXX
) or worker node (ncn-wXXX
) on the system.
NOTE: Etcd Clusters can be restored using the automation script or the manual procedure below. The automation script follows the same steps as the manual procedure. If the automation script fails to get the date from backups, follow the manual procedure.
A backup of a healthy etcd cluster has been created.
The automated script will restore the cluster from the most recent backup if it finds a backup created within the last 7 days. If it does not discover a backup within the last 7 days, it will ask the user if they would like to rebuild the cluster.
ncn-w001# cd /opt/cray/platform-utils/etcd_restore_rebuild_util
# rebuild/restore a single cluster
ncn-w001:/opt/cray/platform-utils/etcd_restore_rebuild_util # ./etcd_restore_rebuild.sh -s cray-bos-etcd
# rebuild/restore multiple clusters
ncn-w001:/opt/cray/platform-utils/etcd_restore_rebuild_util # ./etcd_restore_rebuild.sh -m cray-bos-etcd,cray-uas-mgr-etcd
# rebuild/restore all clusters
ncn-w001:/opt/cray/platform-utils/etcd_restore_rebuild_util # ./etcd_restore_rebuild.sh -a
An example using the automation script is below.
ncn-m001:/opt/cray/platform-utils/etcd_restore_rebuild_util # ./etcd_restore_rebuild.sh -s cray-externaldns-etcd
The following etcd clusters will be restored/rebuilt:
cray-externaldns-etcd
You will be accepting responsibility for any missing data if there is a restore/rebuild over a running etcd k/v. HPE assumes no responsibility.
Proceed restoring/rebuilding? (yes/no)
yes
Proceeding: restoring/rebuilding etcd clusters.
----- Restoring from cray-externaldns/etcd.backup_v8362_2021-08-18-20:00:09
etcdrestore.etcd.database.coreos.com/cray-externaldns-etcd created
- 3/3 Running
Successfully restored cray-externaldns-etcd
etcdrestore.etcd.database.coreos.com "cray-externaldns-etcd" deleted
List the backups for the desired etcd cluster.
The example below uses the Boot Orchestration Service (BOS).
ncn-w001# kubectl exec -it -n operators \
$(kubectl get pod -n operators | grep etcd-backup-restore | head -1 | awk '{print $1}') \
-c boto3 -- list_backups cray-bos
Example output:
cray-bos/etcd.backup_v108497_2020-03-20-23:42:37
cray-bos/etcd.backup_v125815_2020-03-21-23:42:37
cray-bos/etcd.backup_v143095_2020-03-22-23:42:38
cray-bos/etcd.backup_v160489_2020-03-23-23:42:37
cray-bos/etcd.backup_v176621_2020-03-24-23:42:37
cray-bos/etcd.backup_v277935_2020-03-30-23:52:54
cray-bos/etcd.backup_v86767_2020-03-19-18:00:05
Restore the cluster using a backup.
Replace etcd.backup\_v277935\_2020-03-30-23:52:54
in the command below with the name of the backup being used.
ncn-w001# kubectl exec -it -n operators \
$(kubectl get pod -n operators | grep etcd-backup-restore | head -1 | awk '{print $1}') \
-c util -- restore_from_backup cray-bos etcd.backup_v277935_2020-03-30-23:52:54
Example output:
etcdrestore.etcd.database.coreos.com/cray-bos-etcd created
Restart the pods for the etcd cluster.
Watch the pods come back online.
This may take a couple minutes.
ncn-w001# kubectl -n services get pod | grep SERVICE_NAME
Example output:
cray-bos-etcd-498jn7th6p 1/1 Running 0 4h1m
cray-bos-etcd-dj7d894227 1/1 Running 0 3h59m
cray-bos-etcd-tk4pr4kgqk 1/1 Running 0 4
Delete the EtcdRestore custom resource.
This step will make it possible for future restores to occur. Replace the etcdrestore.etcd.database.coreos.com/cray-bos-etcd value with the name returned in step 2.
ncn-w001# kubectl -n services delete etcdrestore.etcd.database.coreos.com/cray-bos-etcd
Example output:
etcdrestore.etcd.database.coreos.com "cray-bos-etcd" deleted
Verify that the cray-bos-etcd-client
service was created.
ncn# kubectl get service -n services cray-bos-etcd-client
Example of output showing that the service was created:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cray-bos-etcd-client ClusterIP 10.28.248.232 <none> 2379/TCP 2m
If the etcd-client
service was not created, then repeat the procedure to restore the cluster again.