There is a known issue where an Etcd cluster backup will fails if it takes longer than 1 minute to complete.
etcdbackup
status contains MultipartUpload: upload multipart failed
.etcd-backup
S3 bucket.ncn-mw# /opt/cray/platform-utils/s3/list-objects.py --bucket-name etcd-backup | grep -v bare-metal
Example output:
cray-bos/etcd.backup_v273436_2022-09-12-13:55:20
cray-bos/etcd.backup_v273436_2022-09-13-13:55:20
cray-bos/etcd.backup_v276316_2022-09-14-13:55:20
cray-bos/etcd.backup_v279196_2022-09-15-13:55:20
cray-bos/etcd.backup_v282076_2022-09-16-13:55:20
cray-bos/etcd.backup_v545936_2022-09-17-13:55:20
cray-bos/etcd.backup_v562933_2022-09-18-13:55:20 <---- Missing backups for 2022-09-19 and 2022-09-20
cray-bss/etcd.backup_v452224_2022-09-20-13:54:09
cray-bss/etcd.backup_v458007_2022-09-19-13:54:09
.
.
.
etcd-backup
S3 bucket for a given Etcd cluster is older than 24 hours,
then check the status of the etcdbackup
resource.This example is checking cray-bos etcdbackup
resource:
ncn-mw# kubectl describe etcdbackup cray-bos-etcd-cluster-periodic-backup -n services | grep -A8 "Status":
Example output:
Status:
Reason: failed to save snapshot (failed to write snapshot (MultipartUpload: upload multipart failed
upload id: 2~V6e_CehW2ULDNNmAgL01mkt2zObm4pg
caused by: RequestCanceled: request context canceled
caused by: context deadline exceeded))
Last Execution Date: 2022-09-20T13:58:04Z
Last Success Date: 2022-09-18-13:55:20Z
Succeeded: false
Events: <none>
Add a backupPolicy.timeoutInSecond
of 600 to the etcdbackup
resource to allow the backup to take up to 10 minutes to complete.
Patch the etcdbackup
resource.
This example patches the cray-bos etcdbackup
resource.
ncn-mw# kubectl patch etcdbackup cray-bos-etcd-cluster-periodic-backup -n services --type=merge -p '{"spec":{"backupPolicy":{"timeoutInSecond": 600}}}'
Example output:
etcdbackup.etcd.database.coreos.com/cray-bos-etcd-cluster-periodic-backup patched
Temporarily set the backupIntervalInSecond
to force a backup every minute.
This example patches the cray-bos etcdbackup
resource:
ncn-mw# INTERVAL=$(kubectl get etcdbackups cray-bos-etcd-cluster-periodic-backup -n services -o json |
jq -r '.spec.backupPolicy.backupIntervalInSecond')
ncn-mw# TMPINTERVAL=60
ncn-mw# kubectl patch etcdbackup cray-bos-etcd-cluster-periodic-backup -n services --type=json \
-p="[{'op' : 'replace', 'path':'/spec/backupPolicy/backupIntervalInSecond', 'value' : \"$TMPINTERVAL\" }]"
Example output:
etcdbackup.etcd.database.coreos.com/cray-bos-etcd-cluster-periodic-backup patched
It will take a few minutes for the new backup to show in the list.
ncn-mw# /opt/cray/platform-utils/s3/list-objects.py --bucket-name etcd-backup | grep -v bare-metal
Example output:
cray-bos/etcd.backup_v276316_2022-09-13-13:55:20
cray-bos/etcd.backup_v276316_2022-09-14-13:55:20
cray-bos/etcd.backup_v279196_2022-09-15-13:55:20
cray-bos/etcd.backup_v282076_2022-09-16-13:55:20
cray-bos/etcd.backup_v545936_2022-09-17-13:55:20
cray-bos/etcd.backup_v562933_2022-09-18-13:55:20
cray-bos/etcd.backup_v569459_2022-09-21-07:36:15 <---- A new backup exists for cray-bos Etcd cluster
cray-bss/etcd.backup_v452224_2022-09-20-13:54:09
cray-bss/etcd.backup_v458007_2022-09-19-13:54:09
.
.
.
backupIntervalInSecond
to the original value so backups are not running every minute.ncn-mw# kubectl patch etcdbackup cray-bos-etcd-cluster-periodic-backup -n services --type=json \
-p="[{'op' : 'replace', 'path':'/spec/backupPolicy/backupIntervalInSecond', 'value' : \"$INTERVAL\" }]"