Cray System Management Documentation > Cray System Management (CSM) Administration Guide > kubernetes > Restore Bare-Metal etcd Clusters from an S3 Snapshot

Restore Bare-Metal etcd Clusters from an S3 Snapshot

The etcd cluster that serves Kubernetes on master nodes is backed up every 10 minutes. These backups are pushed to Ceph Rados Gateway (S3).

Restoring the etcd cluster from backup is only meant to be used in a catastrophic scenario, whereby the Kubernetes cluster and master nodes are being rebuilt. This procedure shows how to restore the bare-metal etcd cluster from an Simple Storage Service (S3) snapshot.

The etcd cluster needs to be restored from a backup when the Kubernetes cluster and master nodes are being rebuilt.

Prerequisites

The Kubernetes cluster on master nodes is being rebuilt.

Procedure

Retrieve the S3 credentials for the Etcd-Backup user.

The following command must be run from a storage node.

ncn# radosgw-admin user info --uid Etcd-Backup | jq -r '.keys'
[
  {
    "user": "Etcd-Backup",
    "access_key": "<value>",
    "secret_key": "<value>"
  }
]

Note the returned access_key and secret_key values.

Configure the S3 client credentials.

The first NCN master node in a deployment contains a couple of helper scripts related to backup and restore. Use the values from the previous step to update the credentials.json file in the /opt/cray/platform/utils/s3 directory.
```
ncn# cd /opt/cray/platform-utils/s3
ncn# vim credentials.json
{
    "access_key": "<value>",
    "secret_key": "<value>",
    "endpoint_url": "http://rgw-vip"
}
```

Select a snapshot to restore a backup.

The following command lists the available backups. It must be run from the /opt/cray/platform-utils/s3 directory.

ncn# ./list-objects.py --bucket-name etcd-backup
bare-metal/etcd-backup-2020-02-04-18-00-10.tar.gz
bare-metal/etcd-backup-2020-02-04-18-10-06.tar.gz
bare-metal/etcd-backup-2020-02-04-18-20-02.tar.gz
bare-metal/etcd-backup-2020-02-04-18-30-10.tar.gz
bare-metal/etcd-backup-2020-02-04-18-40-06.tar.gz
bare-metal/etcd-backup-2020-02-04-18-50-03.tar.gz

Note the file name for the desired snapshot/backup.

Download the snapshot and copy it to all NCN master nodes.

Retrieve the backup from S3 and uncompress it.

ncn# mkdir /tmp/etcd_restore
ncn# cd /opt/cray/platform-utils/s3
ncn# ./download-file.py --bucket-name etcd-backup \
--key-name bare-metal/etcd-backup-2020-02-04-18-50-03.tar.gz \
--file-name /tmp/etcd_restore/etcd-backup-2020-02-04-18-50-03.tar.gz
ncn# cd /tmp/etcd_restore
ncn# gunzip etcd-backup-2020-02-04-18-50-03.tar.gz
ncn# tar -xvf etcd-backup-2020-02-04-18-50-03.tar
ncn# mv etcd-backup-2020-02-04-18-50-03/etcd-dump.bin /tmp

Push the file to the other NCN master nodes.

ncn# scp /tmp/etcd-dump.bin ncn-m002:/tmp
ncn# scp /tmp/etcd-dump.bin ncn-m003:/tmp

Prepare to restore the member directory for ncn-m001.

Log in as root to ncn-m001.
Create a new temporary /tmp/etcd_restore directory.
```
ncn-m001# mkdir /tmp/etcd_restore
```
Change to the /tmp/etcd_restore directory.
```
ncn-m001# cd /tmp/etcd_restore
```

Retrieve the ‘initial-cluster’ and ‘initial-advertise-peer-urls’ values from the kubeadmcfg.yaml file.

The returned values will be used in the next step.

ncn-m001# grep -e initial-cluster: -e initial-advertise-peer-urls: \
/etc/kubernetes/kubeadmcfg.yaml
initial-cluster: ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380
initial-advertise-peer-urls: https://10.252.1.7:2380

Restore the member directory.

ncn-m001# ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/server.crt \
  --key /etc/kubernetes/pki/etcd/server.key \
  --name ncn-m001 \
  --initial-cluster ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380 \
  --initial-cluster-token tkn \
  --initial-advertise-peer-urls https://10.252.1.7:2380 \
  snapshot restore /tmp/etcd-dump.bin

Prepare to restore the member directory for ncn-m002.

Log in as root to ncn-m002.
Create a new temporary /tmp/etcd_restore directory.
```
ncn-m002# mkdir /tmp/etcd_restore
```
Change to the /tmp/etcd_restore directory.
```
ncn-m002# cd /tmp/etcd_restore
```

Retrieve the ‘initial-cluster’ and ‘initial-advertise-peer-urls’ values from the kubeadmcfg.yaml file.

The returned values will be used in the next step.

ncn-m002# grep -e initial-cluster: -e initial-advertise-peer-urls: \
/etc/kubernetes/kubeadmcfg.yaml
initial-cluster: ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380
initial-advertise-peer-urls: https://10.252.1.8:2380

Restore the member directory.

ncn-m002# ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--name ncn-m002 \
--initial-cluster ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380 \
--initial-cluster-token tkn \
--initial-advertise-peer-urls https://10.252.1.8:2380 \
snapshot restore /tmp/etcd-dump.bin

Prepare to restore the member directory for ncn-m003.

Log in as root to ncn-m003.
Create a new temporary /tmp/etcd_restore directory.
```
ncn-m003# mkdir /tmp/etcd_restore
```
Change to the /tmp/etcd_restore directory.
```
ncn-m003# cd /tmp/etcd_restore
```

Retrieve the ‘initial-cluster’ and ‘initial-advertise-peer-urls’ values from the kubeadmcfg.yaml file.

The returned values will be used in the next step.

ncn-m003# grep -e initial-cluster: -e initial-advertise-peer-urls: \
/etc/kubernetes/kubeadmcfg.yaml
initial-cluster: ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380
initial-advertise-peer-urls: https://10.252.1.9:2380

Restore the member directory.

ncn-m003# ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--name ncn-m003 \
--initial-cluster ncn-m001=https://10.252.1.7:2380,ncn-m002=https://10.252.1.8:2380,ncn-m003=https://10.252.1.9:2380 \
--initial-cluster-token tkn \
--initial-advertise-peer-urls https://10.252.1.9:2380 \
snapshot restore /tmp/etcd-dump.bin

Stop the current running cluster.

If the cluster is currently running, run the following command on all three master nodes (ncn-m001, ncn-m002, ncn-m003).
1. Stop the cluster on ncn-m001.
```
ncn-m001# systemctl stop etcd
```
2. Stop the cluster on ncn-m002.
```
ncn-m002# systemctl stop etcd
```
3. Stop the cluster on ncn-m003.
```
ncn-m003# systemctl stop etcd
```

Start the restored cluster on each master node.

Run the following commands on all three master nodes (ncn-m001, ncn-m002, ncn-m003) to start the restored cluster.

Start the cluster on ncn-m001.

ncn-m001# rm -rf /var/lib/etcd/member
ncn-m001# mv ncn-m001.etcd/member/ /var/lib/etcd/
ncn-m001# systemctl start etcd

Start the cluster on ncn-m002.

ncn-m002# rm -rf /var/lib/etcd/member
ncn-m002# mv ncn-m002.etcd/member/ /var/lib/etcd/
ncn-m002# systemctl start etcd

Start the cluster on ncn-m003.

ncn-m003# rm -rf /var/lib/etcd/member
ncn-m003# mv ncn-m003.etcd/member/ /var/lib/etcd/
ncn-m003# systemctl start etcd

Confirm the membership of the cluster.

ncn-m001# ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key member list
448a8d056377359a, started, ncn-m001, https://10.252.1.7:2380, https://10.252.1.7:2379,https://127.0.0.1:2379
986f6ff2a30b01cb, started, ncn-m002, https://10.252.1.8:2380, https://10.252.1.8:2379,https://127.0.0.1:2379
d5a8e497e2788510, started, ncn-m003, https://10.252.1.9:2380, https://10.252.1.9:2379,https://127.0.0.1:2379