Check to see if all of the etcd clusters have the correct number of healthy pods and a healthy cluster database. Any clusters that do not have healthy pods will need to be either restored from backup or rebuilt.
This procedure requires root privileges.
(ncn-mw#
) Check the health of the clusters.
To check the health of the etcd clusters in the services namespace without TLS authentication:
/opt/cray/platform-utils/ncnHealthChecks.sh -s etcd_health_status
Example output:
**************************************************************************
=== Check the Health of the Etcd Clusters in all Namespaces. ===
=== Verify a "healthy" Report for Each Etcd Pod. ===
Fri 10 Mar 2023 07:52:09 PM UTC
### cray-bos-bitnami-etcd-0 ###
127.0.0.1:2379 is healthy: successfully committed proposal: took = 4.166761ms
### cray-bos-bitnami-etcd-1 ###
127.0.0.1:2379 is healthy: successfully committed proposal: took = 4.697124ms
### cray-bos-bitnami-etcd-2 ###
127.0.0.1:2379 is healthy: successfully committed proposal: took = 4.119712ms
[...]
--- PASSED ---
If any of the etcd clusters are not healthy, refer to Restore an etcd Cluster from a Backup.
(ncn-mw#
) Check the number of pods in each cluster.
Each cluster should contain at least three pods.
/opt/cray/platform-utils/ncnHealthChecks.sh -s etcd_cluster_balance
Example output:
**************************************************************************
=== Check the Number of Pods in Each Cluster. Verify they are Balanced. ===
=== Each cluster should contain at least three pods, but may contain more. ===
=== Ensure that no two pods in a given cluster exist on the same worker node. ===
Fri 10 Mar 2023 07:54:22 PM UTC
cray-bos-bitnami-etcd-0 2/2 Running 0 22h 10.32.0.76 ncn-w002 <none> <none>
cray-bos-bitnami-etcd-1 2/2 Running 0 22h 10.40.0.8 ncn-w003 <none> <none>
cray-bos-bitnami-etcd-2 2/2 Running 0 22h 10.44.0.58 ncn-w001 <none> <none>
[...]
--- PASSED ---
If the etcd clusters have fewer than three pods in a ‘Running’ state, see Restore an etcd Cluster from a Backup.
(ncn-mw#
) Check the health of all etcd clusters’ databases:
/opt/cray/platform-utils/ncnHealthChecks.sh -s etcd_database_health
Example output:
**************************************************************************
=== Check the health of Etcd Cluster's database in the Services Namespace. ===
=== PASS or FAIL status returned. ===
### cray-bos-bitnami-etcd-0 Etcd Database Check: ###
PASS: OK foo fooCheck 1
### cray-bos-bitnami-etcd-1 Etcd Database Check: ###
PASS: OK foo fooCheck 1
### cray-bos-bitnami-etcd-2 Etcd Database Check: ###
PASS: OK foo fooCheck 1
[...]
--- PASSED ---
If any of the etcd cluster databases are not healthy, then refer to the following procedures: