Validate Health

Description

Validate that the system is healthy.

Procedure

The following procedures can be run from any master NCN.

  1. (ncn-m#) Collect data about the system management platform health.

    /opt/cray/platform-utils/ncnHealthChecks.sh
    /opt/cray/platform-utils/ncnPostgresHealthChecks.sh
    

    NOTE If workers have been removed and the worker count is currently at two, the following failures can be ignored. A re-check will be needed once workers are added and the count returns to three or above.

    • the ncnPostgresHealthChecks may report Unable to determine a leader and one of the three Postgres pods may be in Pending state.
    • the ncnHealthChecks may report Error from server...FAILED - Pod Not Healthy, FAILED DATABASE CHECK and one of the three Etcd pods may be in Pending state.

    NOTE If ncn-s001, ncn-s002, or ncn-s003 has been temporarily removed, HEALTH_WARN may be seen until the storage node is added back to the cluster.

    • the ncnHealthChecks may report FAIL: Ceph's health status is not "HEALTH_OK". If Ceph health is HEALTH_WARN, this failure can be ignored.
  2. (ncn-m#) Restart the Goss server on all the NCNs. Adjust the commands based on the number of master, worker, and storage nodes.

    pdsh -w ncn-m00[1-3],ncn-w00[1-3],ncn-s00[1-3] systemctl restart goss-servers
    
  3. If it has not been done previously, record in Vault the admin user password for the management switches in the system.

    See Adding switch admin password to Vault.

  4. (ncn-m#) Validate the health of the various subsystems.

    /opt/cray/tests/install/ncn/automated/ncn-healthcheck-master
    /opt/cray/tests/install/ncn/automated/ncn-healthcheck-worker
    /opt/cray/tests/install/ncn/automated/ncn-healthcheck-storage
    /opt/cray/tests/install/ncn/automated/ncn-kubernetes-checks 
    

    NOTE The following errors can be ignored if <NODE> has been removed and it is one of the first three worker, master, or storage nodes:

    • Server URL: http://<NODE> ... ERROR: Server endpoint could not be reached.

    NOTE If workers have been removed and the worker count is currently at two, then failures for the following tests can be ignored:

    • Kubernetes Postgres Clusters have the Correct Number of Pods 'Running'
    • Kubernetes Postgres Clusters Have Leaders
    • Kubernetes Postgres Check for Replication Lag Across Pods in a Cluster
    • Verify cray etcd is healthy

    A re-check will be needed once workers are added and the count returns to three or above.

    NOTE If a storage node has been added, then ncn-healthcheck-storage failures for the following test may need to be remediated based on the test description information. After that is done, the ncn-healthcheck-storage tests should then be re-run to verify that all tests pass.

    • Spire Health Check

The procedure is complete. Return to Main Page.