Failed to start etcd on Master NCNWhen deploying the final NCN, at times etcd may fail to rejoin the etcd cluster.
This procedure provides steps to recover from this issue.
Identify unhealthy member.
Run etcdctl member list on each master node.
etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member list
Example output from healthy master node (assuming run from ncn-m002):
60a0d077eb0db20f, started, ncn-m001, https://10.252.1.4:2380, https://10.252.1.4:2379,https://127.0.0.1:2379, false
b0dd65d7036d6932, started, ncn-m003, https://10.252.1.6:2380, https://10.252.1.6:2379,https://127.0.0.1:2379, false
c0d7b0944e709721, started, ncn-m002, https://10.252.1.5:2380, https://10.252.1.5:2379,https://127.0.0.1:2379, false
Example output from unhealthy master node (assuming run from ncn-m001):
{"level":"warn","ts":"2023-03-06T17:44:25.725Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00022e000/#initially=[https://127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded
Given the above, ncn-m001 is unhealthy, and the remainder of these steps provide an example of how to remove and re-add ncn-m001 back into the etcd cluster.
Stop etcd on the unhealthy NCN (ncn-m001 is used as an example):
systemctl stop etcd
Remove and re-add the unhealthy member from the cluster on a healthy NCN (ncn-m002 in this example):
Determine the member id, name (same as NCN name), and peer-urls from current member list from output above:
60a0d077eb0db20f, started, ncn-m001, https://10.252.1.4:2380, https://10.252.1.4:2379,https://127.0.0.1:2379, false
^^^^^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
member id name peer-urls
Using the values above, remove the member:
etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member remove 60a0d077eb0db20f
Example output:
Member 60a0d077eb0db20f removed from cluster f1c6e6ee71e931c3
Then re-add the member:
etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member add ncn-m001 --peer-urls=https://10.252.1.4:2380
Example output:
Member be55f20f284cbc1b added to cluster f1c6e6ee71e931c3
ETCD_NAME="ncn-m001"
ETCD_INITIAL_CLUSTER="ncn-m003=https://10.252.1.6:2380,ncn-m001=https://10.252.1.4:2380,ncn-m002=https://10.252.1.5:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.252.1.4:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
Member list should now show ncn-m001 as unstarted:
etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member list
Example output:
c0d7b0944e709721, started, ncn-m002, https://10.252.1.5:2380, https://10.252.1.5:2379,https://127.0.0.1:2379, false
b0dd65d7036d6932, started, ncn-m003, https://10.252.1.6:2380, https://10.252.1.6:2379,https://127.0.0.1:2379, false
be55f20f284cbc1b, unstarted, , https://10.252.1.4:2380, , false
Remove etcd member data on unhealthy NCN (ncn-m001):
rm -rf /var/lib/etcd/member
Set the etcd service to start as existing (ncn-m001):
sed -i 's/new/existing/' /etc/systemd/system/etcd.service /srv/cray/resources/common/etcd/etcd.service
systemctl daemon-reload
Start etcd on the unhealthy NCN (ncn-m001):
systemctl start etcd
Member list on all three masters should now show ncn-m001 back in the cluster with a new member id:
etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member list
Example output:
b0dd65d7036d6932, started, ncn-m003, https://10.252.1.6:2380, https://10.252.1.6:2379,https://127.0.0.1:2379, false
be55f20f284cbc1b, started, ncn-m001, https://10.252.1.4:2380, https://10.252.1.4:2379,https://127.0.0.1:2379, false
c0d7b0944e709721, started, ncn-m002, https://10.252.1.5:2380, https://10.252.1.5:2379,https://127.0.0.1:2379, false
At this point, etcd is healthy. If the NCN has yet to join the K8S cluster, running the following script should join it now that etcd is healthy:
/srv/cray/scripts/common/kubernetes-cloudinit.sh