Validate that the system is healthy.
The following procedures can be run from any master or worker node.
Collect data about the system management platform health.
ncn-mw# /opt/cray/platform-utils/ncnHealthChecks.sh
ncn-mw# /opt/cray/platform-utils/ncnPostgresHealthChecks.sh
NOTE: If workers have been removed and the worker count is currently at two, the following failures can be ignored. A re-check will be needed once workers are added and the count returns to three or above.
ncnPostgresHealthChecks
may report Unable to determine a leader
and one of the three Postgres pods may be in Pending
state.ncnHealthChecks
may report Error from server...FAILED - Pod Not Healthy
, FAILED DATABASE CHECK
and one of the three Etcd pods may be in Pending
state.NOTE:
If ncn-s001
, ncn-s002
, or ncn-s003
has been temporarily removed, HEALTH_WARN
may be seen until the storage node is added back to the cluster.
ncnHealthChecks
may report FAIL: Ceph's health status is not "HEALTH_OK"
. If Ceph health is HEALTH_WARN
, this failure can be ignored.Restart the Goss server on all the management nodes.
ncn-mw# pdsh -w $(grep -oP 'ncn-\w\d+' /etc/hosts | sort -u | tr -t '\n' ',') \
systemctl restart goss-servers
Specify the admin
user password for the management switches in the system.
read -s
is used in order to prevent the password from being echoed to the screen or saved in the shell history.
ncn-mw# read -s SW_ADMIN_PASSWORD
ncn-mw# export SW_ADMIN_PASSWORD
Collect data about the various subsystems.
ncn-mw# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-master
ncn-mw# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-worker
ncn-mw# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-storage
ncn-mw# /opt/cray/tests/install/ncn/automated/ncn-kubernetes-checks
NOTE:
The following errors can be ignored if <NODE>
has been removed and it is one of the first three worker, master, or storage nodes:
Server URL: http://<NODE> ... ERROR: Server endpoint could not be reached
.NOTE: If workers have been removed and the worker count is currently at two, failures for the following tests can be ignored:
Kubernetes Postgres Clusters have the Correct Number of Pods 'Running'
Kubernetes Postgres Clusters Have Leaders
Kubernetes Postgres Check for Replication Lag Across Pods in a Cluster
Verify cray etcd is healthy
A re-check will be needed once workers are added and the count returns to three or above.
NOTE:
If a storage node has been added, then ncn-healthcheck-storage
failures for the following test may need to be remediated based on the test description information.
After that is done, the ncn-healthcheck-storage
tests should then be re-run to verify that all tests pass.
Spire Health Check
The procedure is complete. Return to Main Page.