When an etcd cluster is not healthy, it needs to be rebuilt. During that process, the pods that rely on etcd clusters lose data. That data needs to be repopulated in order for the cluster to go back to a healthy state.
The following services need their data repopulated in the etcd cluster:
An etcd cluster was rebuilt. See Rebuild Unhealthy etcd Clusters.
Reconstruct boot session templates for impacted product streams to repopulate data.
Boot preparation information for other product streams can be found in the following locations:
Data is repopulated in BSS when the REDS init
job is run.
Get the current REDS job.
ncn-mw# kubectl get -o json -n services job/cray-reds-init |
jq 'del(.spec.template.metadata.labels["controller-uid"], .spec.selector)' > cray-reds-init.json
Delete the reds-client-init
job.
ncn-mw# kubectl delete -n services -f cray-reds-init.json
Restart the reds-client-init
job.
ncn-mw# kubectl apply -n services -f cray-reds-init.json
Repopulate clusters for CPS.
Note: CRUS is deprecated in CSM 1.2.0 and it will be removed in CSM 1.5.0. It will be replaced with BOS V2, which will provide similar functionality.
View the progress of existing CRUS sessions.
List the existing CRUS sessions to find the upgrade_id for the desired session.
ncn# cray crus session list --format toml
Example output:
[[results]]
api_version = "1.0.0"
completed = false
failed_label = "failed-nodes"
kind = "ComputeUpgradeSession"
messages = [ "Quiesce requested in step 0: moving to QUIESCING", "All nodes quiesced in step 0: moving to QUIESCED", "Began the boot session for step 0: moving to BOOTING",]
starting_label = "slurm-nodes"
state = "UPDATING"
upgrade_id = "e0131663-dbee-47c2-aa5c-13fe9b110242" <<-- Note this value
upgrade_step_size = 50
upgrade_template_id = "boot-template"
upgrading_label = "upgrading-nodes"
workload_manager_type = "slurm"
Describe the CRUS session to see if the session failed or is stuck.
If the session continued and appears to be in a healthy state, proceed to the BSS section.
ncn# cray crus session describe CRUS_UPGRADE_ID --format toml
Example output:
api_version = "1.0.0"
completed = false
failed_label = "failed-nodes"
kind = "ComputeUpgradeSession"
messages = [ "Quiesce requested in step 0: moving to QUIESCING", "All nodes quiesced in step 0:
moving to QUIESCED", "Began the boot session for step 0: moving to BOOTING",]
starting_label = "slurm-nodes"
state = "UPDATING"
upgrade_id = "e0131663-dbee-47c2-aa5c-13fe9b110242"
upgrade_step_size = 50
upgrade_template_id = "boot-template"
upgrading_label = "upgrading-nodes"
workload_manager_type = "slurm"
Find the name of the running CRUS pod.
ncn# kubectl get pods -n services | grep cray-crus
Example output:
cray-crus-549cb9cb5d-jtpqg 3/4 Running 528 25h
Restart the CRUS pod.
Deleting the pod will restart CRUS and start the discovery process for any data recovered in etcd.
ncn# kubectl delete pods -n services POD_NAME
The etcd cluster for external DNS maintains an ephemeral cache for CoreDNS. There is no reason to back it up. If it is having any issues, then delete it and recreate it.
Save the external DNS configuration.
Edit the end of each .yaml
file to remove the .status
, .metadata.uid
, .metadata.selfLink
, .metadata.resourceVersion
, .metadata.generation
,
and .metadata.creationTimestamp
.
For example:
apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdCluster
metadata:
annotations:
etcd.database.coreos.com/scope: clusterwide
labels:
app.kubernetes.io/name: cray-externaldns-etcd
name: cray-externaldns-etcd
namespace: services
spec:
pod:
ClusterDomain: ""
annotations:
sidecar.istio.io/inject: "false"
busyboxImage: registry.local/library/busybox:1.28.0-glibc
persistentVolumeClaimSpec:
accessModes:
- ReadWriteOnce
dataSource: null
resources:
requests:
storage: 1Gi
resources: {}
repository: registry.local/coreos/etcd
size: 3
version: 3.3.8
Delete the current cluster.
ncn-mw# kubectl -n services delete etcd cray-externaldns-etcd
Recreate the cluster.
ncn-mw# kubectl apply -f cray-externaldns-etcd.yaml
Run the cray-fas-loader
Kubernetes job.
Refer to the “Use the cray-fas-loader
Kubernetes Job” section in FAS Admin Procedures for more information.
When the etcd cluster is rebuilt, all historic data for firmware actions and all recorded snapshots will be lost. Image data will need to be reloaded by following the
cray-fas-loader
Kubernetes job procedure. After images are reloaded any running actions at time of failure will need to be recreated.
Resubscribe the compute nodes and any NCNs that use the ORCA daemon for their State Change Notifications (SCN).
Resubscribe all compute nodes.
ncn-m# TMPFILE=$(mktemp)
ncn-m# sat status --no-borders --no-headings | grep Ready | grep Compute | awk '{printf("nid%06d-nmn\n",$3);}' > $TMPFILE
ncn-m# pdsh -w ^${TMPFILE} "systemctl restart cray-dvs-orca"
ncn-m# rm -rf $TMPFILE
Resubscribe the NCNs.
NOTE: Modify the -w
arguments in the following commands to reflect the number of worker and storage nodes in the system.
ncn-m# pdsh -w ncn-w00[0-4]-can.local "systemctl restart cray-dvs-orca"
ncn-m# pdsh -w ncn-s00[0-4]-can.local "systemctl restart cray-dvs-orca"
Restart MEDS.
ncn-mw# kubectl -n services delete pods --selector='app.kubernetes.io/name=cray-meds'
Restart REDS.
ncn-mw# kubectl -n services delete pods --selector='app.kubernetes.io/name=cray-reds'