The etcd cluster that serves Kubernetes on master nodes is backed up every 10 minutes. These backups are pushed to Ceph Rados Gateway (S3).
Restoring the etcd cluster from backup is only meant to be used in a catastrophic scenario, whereby the Kubernetes cluster and master nodes are being rebuilt. This procedure shows how to restore the bare-metal etcd cluster from an Simple Storage Service (S3) snapshot.
The etcd cluster needs to be restored from a backup when the Kubernetes cluster and master nodes are being rebuilt.
The Kubernetes cluster on master nodes is being rebuilt.
Retrieve the S3 credentials for the Etcd-Backup user.
The following command must be run from a storage node.
ncn# radosgw-admin user info --uid Etcd-Backup | jq -r '.keys'
[
{
"user": "Etcd-Backup",
"access_key": "<value>",
"secret_key": "<value>"
}
]
Note the returned access_key
and secret_key
values.
Configure the S3 client credentials.
The first NCN master node in a deployment contains a couple of helper scripts related to backup and restore. Use the values from the previous step to update the credentials.json file in the /opt/cray/platform/utils/s3 directory.
ncn# cd /opt/cray/platform-utils/s3
ncn# vim credentials.json
{
"access_key": "<value>",
"secret_key": "<value>",
"endpoint_url": "http://rgw-vip"
}
Select a snapshot to restore a backup.
The following command lists the available backups. It must be run from the /opt/cray/platform-utils/s3 directory.
ncn# ./list-objects.py --bucket-name etcd-backup
bare-metal/etcd-backup-2020-02-04-18-00-10.tar.gz
bare-metal/etcd-backup-2020-02-04-18-10-06.tar.gz
bare-metal/etcd-backup-2020-02-04-18-20-02.tar.gz
bare-metal/etcd-backup-2020-02-04-18-30-10.tar.gz
bare-metal/etcd-backup-2020-02-04-18-40-06.tar.gz
bare-metal/etcd-backup-2020-02-04-18-50-03.tar.gz
Note the file name for the desired snapshot/backup.
Download the snapshot and copy it to all NCN master nodes.
Retrieve the backup from S3 and uncompress it.
ncn# mkdir /tmp/etcd_restore
ncn# cd /opt/cray/platform-utils/s3
ncn# ./download-file.py --bucket-name etcd-backup \
--key-name bare-metal/etcd-backup-2020-02-04-18-50-03.tar.gz \
--file-name /tmp/etcd_restore/etcd-backup-2020-02-04-18-50-03.tar.gz
ncn# cd /tmp/etcd_restore
ncn# gunzip etcd-backup-2020-02-04-18-50-03.tar.gz
ncn# tar -xvf etcd-backup-2020-02-04-18-50-03.tar
ncn# mv etcd-backup-2020-02-04-18-50-03/etcd-dump.bin /tmp
Push the file to the other NCN master nodes.
ncn# scp /tmp/etcd-dump.bin ncn-m002:/tmp
ncn# scp /tmp/etcd-dump.bin ncn-m003:/tmp
Prepare to restore the member directory for ncn-m001
.
Log in as root to ncn-m001
.
Create a new temporary /tmp/etcd_restore directory.
ncn-m001# mkdir /tmp/etcd_restore
Change to the /tmp/etcd_restore directory.
ncn-m001# cd /tmp/etcd_restore
Retrieve the ‘initial-cluster’ and ‘initial-advertise-peer-urls’ values from the kubeadmcfg.yaml
file.
The returned values will be used in the next step.
ncn-m001# grep -e initial-cluster: -e initial-advertise-peer-urls: \
/etc/kubernetes/kubeadmcfg.yaml
initial-cluster: ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380
initial-advertise-peer-urls: https://10.252.1.7:2380
Restore the member directory.
ncn-m001# ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--name ncn-m001 \
--initial-cluster ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380 \
--initial-cluster-token tkn \
--initial-advertise-peer-urls https://10.252.1.7:2380 \
snapshot restore /tmp/etcd-dump.bin
Prepare to restore the member directory for ncn-m002
.
Log in as root to ncn-m002
.
Create a new temporary /tmp/etcd_restore directory.
ncn-m002# mkdir /tmp/etcd_restore
Change to the /tmp/etcd_restore
directory.
ncn-m002# cd /tmp/etcd_restore
Retrieve the ‘initial-cluster’ and ‘initial-advertise-peer-urls’ values from the kubeadmcfg.yaml
file.
The returned values will be used in the next step.
ncn-m002# grep -e initial-cluster: -e initial-advertise-peer-urls: \
/etc/kubernetes/kubeadmcfg.yaml
initial-cluster: ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380
initial-advertise-peer-urls: https://10.252.1.8:2380
Restore the member directory.
ncn-m002# ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--name ncn-m002 \
--initial-cluster ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380 \
--initial-cluster-token tkn \
--initial-advertise-peer-urls https://10.252.1.8:2380 \
snapshot restore /tmp/etcd-dump.bin
Prepare to restore the member directory for ncn-m003
.
Log in as root to ncn-m003
.
Create a new temporary /tmp/etcd_restore directory.
ncn-m003# mkdir /tmp/etcd_restore
Change to the /tmp/etcd_restore
directory.
ncn-m003# cd /tmp/etcd_restore
Retrieve the ‘initial-cluster’ and ‘initial-advertise-peer-urls’ values from the kubeadmcfg.yaml
file.
The returned values will be used in the next step.
ncn-m003# grep -e initial-cluster: -e initial-advertise-peer-urls: \
/etc/kubernetes/kubeadmcfg.yaml
initial-cluster: ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380
initial-advertise-peer-urls: https://10.252.1.9:2380
Restore the member directory.
ncn-m003# ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--name ncn-m003 \
--initial-cluster ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380 \
--initial-cluster-token tkn \
--initial-advertise-peer-urls https://10.252.1.9:2380 \
snapshot restore /tmp/etcd-dump.bin
Stop the current running cluster.
If the cluster is currently running, run the following command on all three master nodes (ncn-m001
, ncn-m002
, ncn-m003
).
Stop the cluster on ncn-m001
.
ncn-m001# systemctl stop etcd
Stop the cluster on ncn-m002
.
ncn-m002# systemctl stop etcd
Stop the cluster on ncn-m003
.
ncn-m003# systemctl stop etcd
Start the restored cluster on each master node.
Run the following commands on all three master nodes (ncn-m001
, ncn-m002
, ncn-m003
) to start the restored cluster.
Start the cluster on ncn-m001
.
ncn-m001# rm -rf /var/lib/etcd/member
ncn-m001# mv ncn-m001.etcd/member/ /var/lib/etcd/
ncn-m001# systemctl start etcd
Start the cluster on ncn-m002
.
ncn-m002# rm -rf /var/lib/etcd/member
ncn-m002# mv ncn-m002.etcd/member/ /var/lib/etcd/
ncn-m002# systemctl start etcd
Start the cluster on ncn-m003
.
ncn-m003# rm -rf /var/lib/etcd/member
ncn-m003# mv ncn-m003.etcd/member/ /var/lib/etcd/
ncn-m003# systemctl start etcd
Confirm the membership of the cluster.
ncn-m001# ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key member list
448a8d056377359a, started, ncn-m001, https://10.252.1.7:2380, https://10.252.1.7:2379,https://127.0.0.1:2379
986f6ff2a30b01cb, started, ncn-m002, https://10.252.1.8:2380, https://10.252.1.8:2379,https://127.0.0.1:2379
d5a8e497e2788510, started, ncn-m003, https://10.252.1.9:2380, https://10.252.1.9:2379,https://127.0.0.1:2379