This document describes how to interpret the results of the HMS Health Check scripts and techniques for troubleshooting when failures occur.
The HMS health checks will not fail if the Cray CLI is not configured. However, some of the troubleshooting suggestions for investigating test failures involve using the CLI. For information on configuring the Cray CLI, see Cray command line interface.
The HMS CT tests are API tests intended to verify that HMS services are installed, operational, and behave as expected. There are two types of CT tests for HMS
services: smoke and functional. Both are executed via Helm test jobs that are defined in the Helm chart for the service that they test. The CT smoke and functional
tests are invoked using the helm test
command on worker or master NCNs (if applicable). Administrators execute the HMS CT tests using a script called
run_hms_ct_tests.sh
as part of the CSM health validation procedures.
The CT smoke tests are basic API tests that make calls to HMS service APIs and verify that the expected status codes are returned. They run first for each service and are useful for verifying that HMS services are installed and responsive. These tests will fail if the service being tested is not installed, unhealthy, or unresponsive.
The CT functional tests are more rigorous API tests that inspect the response bodies and verify that the fields, values, and form of the data returned are as expected. They run after the smoke tests and verify that HMS service APIs behave correctly and in accordance with their API specification. They also detect issues that prevent the proper management or expected use of hardware in the system.
The run_hms_ct_tests.sh
script executes the HMS CT tests in parallel. It waits for each Helm test job to complete, logs the results in a file for the test run, and
prints a summary of the results. The script returns a status code of zero if all tests pass and non-zero if there are one or more failures.
Example output:
Log file for run is: /opt/cray/tests/hms_ct_test-<datetime>.log
Running all tests...
DONE.
SUCCESS: All 9 service tests passed: bss, capmc, fas, hbtd, hmnfd, hsm, reds, scsd, sls
The following is example output reporting a single service failure:
Log file for run is: /opt/cray/tests/hms_ct_test-<datetime>.log
Running all tests...
DONE.
FAILURE: 1 service test FAILED (hsm), 8 passed (bss, capmc, fas, hbtd, hmnfd, reds, scsd, sls)
For troubleshooting and manual steps, see: https://github.com/Cray-HPE/docs-csm/blob/main/troubleshooting/hms_ct_manual_run.md
The following is an example output reporting multiple service failures:
Log file for run is: /opt/cray/tests/hms_ct_test-<datetime>.log
Running all tests...
DONE.
FAILURE: All 9 service tests FAILED: bss, capmc, fas, hbtd, hmnfd, hsm, reds, scsd, sls
For troubleshooting and manual steps, see: https://github.com/Cray-HPE/docs-csm/blob/main/troubleshooting/hms_ct_manual_run.md
If one or more service tests fail, the log file for the run should be inspected to determine which test job(s) failed.
The following is an example section of a log file reporting a smoke test failure:
NAME: cray-hms-smd
LAST DEPLOYED: Thu Jun 16 15:46:10 2022
NAMESPACE: services
STATUS: deployed
REVISION: 9
TEST SUITE: cray-hms-smd-test-smoke
Last Started: Fri Jul 1 21:12:58 2022
Last Completed: Fri Jul 1 21:14:25 2022
Phase: Failed
In this case, the HSM smoke test job failed. Find the name of the pod and inspect its logs to determine the cause of the failure.
(ncn-mw#
) Find the name of the pod.
kubectl -n services get pods | grep -E "smd|NAME"
Example output:
NAME READY STATUS RESTARTS AGE
cray-hms-smd-test-smoke-2npqz 1/2 NotReady 0 83s
cray-smd-747b59d979-2vvdw 2/2 Running 0 4d6h
cray-smd-747b59d979-c5rhl 2/2 Running 0 4d5h
cray-smd-747b59d979-vcv6c 2/2 Running 0 4d6h
(ncn-mw#
) Show its logs.
kubectl -n services logs cray-hms-smd-test-smoke-2npqz smoke
Example output:
Running smoke tests...
...
2022-07-01 21:13:05,853 Testing {"path": "hsm/v2/service/ready", "expected_status_code": 200, "method": "GET", "body": null, "headers": {}, "url": "http://cray-smd/hsm/v2/service/ready"}
2022-07-01 21:13:05,863 Starting new HTTP connection (1): cray-smd:80
2022-07-01 21:13:05,873 FAIL: HTTPConnectionPool(host='cray-smd', port=80): Max retries exceeded with url: /hsm/v2/service/ready (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7faf6fdf6460>: Failed to establish a new connection: [Errno 111] Connection refused'))
...
2022-07-01 21:13:09,282 FAIL: hsm-smoke-tests
2022-07-01 21:13:09,282 failed!
2022-07-01 21:13:09,282 FAIL: hsm-smoke-tests ran with failures
The following is an example section of a log file reporting a functional test failure:
NAME: cray-hms-smd
LAST DEPLOYED: Thu Jun 16 15:46:10 2022
NAMESPACE: services
STATUS: deployed
REVISION: 9
TEST SUITE: cray-hms-smd-test-functional
Last Started: Fri Jul 1 21:12:58 2022
Last Completed: Fri Jul 1 21:14:25 2022
Phase: Failed
In this case, the HSM functional test job failed. Find the name of the pod and inspect its logs to determine the cause of the failure.
(ncn-mw#
) Find the name of the pod.
kubectl -n services get pods | grep -E "smd|NAME"
Example output:
NAME READY STATUS RESTARTS AGE
cray-hms-smd-test-functional-fs8b4 1/2 NotReady 0 61s
cray-hms-smd-test-smoke-2npqz 0/2 Completed 0 103s
cray-smd-747b59d979-2vvdw 2/2 Running 0 4d6h
cray-smd-747b59d979-c5rhl 2/2 Running 0 4d5h
cray-smd-747b59d979-vcv6c 2/2 Running 0 4d6h
(ncn-mw#
) Show its logs.
kubectl -n services logs cray-hms-smd-test-functional-fs8b4 functional
Example output:
Running functional tests...
============================= test session starts ==============================
platform linux -- Python 3.10.4, pytest-7.1.2, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /src/app, configfile: pytest.ini
plugins: tap-3.3, tavern-1.23.1
collecting ... collected 38 items
...
test_components.tavern.yaml::Ensure that we can conduct a query for all Nodes in the Component collection FAILED [ 21%]
...
=================================== FAILURES ===================================
_ /src/app/test_components.tavern.yaml::Ensure that we can conduct a query for all Nodes in the Component collection _
...
Errors:
E tavern.util.exceptions.TestFailError: Test 'Verify the expected response fields for all Nodes' failed:
- Error calling validate function '<function validate_pykwalify at 0x7f34e3ebf6d0>':
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/tavern/schemas/files.py", line 106, in verify_generic
verifier.validate()
File "/usr/lib/python3.10/site-packages/pykwalify/core.py", line 194, in validate
raise SchemaError(u"Schema validation failed:\n - {error_msg}.".format(
pykwalify.errors.SchemaError: <SchemaError: error code 2: Schema validation failed:
- Enum 'Alert' does not exist. Path: '/Components/8/Flag' Enum: ['OK'].: Path: '/'>
...
=========================== short test summary info ============================
FAILED test_components.tavern.yaml::Ensure that we can conduct a query for all Nodes in the Component collection
======================== 1 failed, 37 passed in 29.06s =========================
2022-07-01 21:13:09,282 FAIL
Tavern is a pytest
-based API testing framework. The CT functional tests consist of Tavern tests for HMS services that are written in YAML and are executed via Helm test
jobs. This section describes the output format of Tavern and where to look when investigating functional test failures.
First, a summary of the test suites executed and their results is printed:
============================= test session starts ==============================
platform linux -- Python 3.10.4, pytest-7.1.2, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /src/app, configfile: pytest.ini
plugins: tap-3.3, tavern-1.23.1
collecting ... collected 10 items
test_component_endpoints.tavern.yaml::Query the ComponentEndpoints collection PASSED [ 2%]
test_components.tavern.yaml::Ensure that we can conduct a query for all Nodes in the Component collection PASSED [ 21%]
test_discovery_status.tavern.yaml::Ensure that we can gather the system discovery status information PASSED [ 39%]
test_groups.tavern.yaml::Verify POST, GET, PATCH, and DELETE methods for various /groups APIs PASSED [ 42%]
test_hardware.tavern.yaml::Query the Hardware collection PASSED [ 44%]
test_memberships.tavern.yaml::Ensure that we can gather information from the memberships collection PASSED [ 57%]
test_partitions.tavern.yaml::Verify POST, GET, PATCH, and DELETE methods for various /partitions APIs PASSED [ 60%]
test_redfish_endpoints.tavern.yaml::Ensure that we can gather information from the RedfishEndpoints collection PASSED [ 63%]
test_service_endpoints.tavern.yaml::Query the ServiceEndpoints collection PASSED [ 65%]
test_state_change_notifications.tavern.yaml::Ensure that we can gather information from the state change notifications collection PASSED [100%]
============================= 10 passed in 24.87s ==============================
2022-07-01 21:14:57,296 PASS
When test failures occur, additional output is printed below the summary table that includes the following:
Source test stage
that was executing when the failure occurred. This is a portion of the source code for the failed test case.Formatted stage
that was executing when the failure occurred. This is a portion of the source code for the failed test case with its variables filled in with
the values that were set at the time of the failure. This includes the request header, method, URL, and other data from the failed test case, which is useful for
attempting to reproduce the failure manually with curl
.Errors
encountered when processing the API response that caused the failure. This is the first place to look when debugging Tavern test failures.The following is an example Source test stage
:
Source test stage (line 179):
- name: Ensure the boot script service can provide the bootscript for a given node
request:
url: "{base_url}/bss/boot/v1/bootscript?nid={nid}"
method: GET
headers:
Authorization: "Bearer {access_token}"
verify: !bool "{verify}"
response:
status_code: 200
The following is an example Formatted stage
:
Formatted stage:
name: Ensure the boot script service can provide the bootscript for a given node
request:
headers:
Authorization: Bearer <REDACTED>
method: GET
url: 'https://api-gw-service-nmn.local/apis/bss/boot/v1/bootscript?nid=None'
verify: !bool 'False'
response:
status_code: 200
The following is an example Errors
section:
Errors:
E tavern.util.exceptions.TestFailError: Test 'Ensure the boot script service can provide the bootscript for a given node' failed:
- Status code was 400, expected 200:
{"type": "about:blank", "title": "Bad Request", "detail": "Need a mac=, name=, or nid= parameter", "status": 400}
This section provides guidance for handling specific HMS health check failures that may occur.
run_hms_ct_tests.sh
This script runs the suite of HMS CT tests.
cray-hms-smd-test-functional
This job executes the tests for Hardware State Manager (HSM).
test_components.tavern.yaml
and test_hardware.tavern.yaml
These tests require compute nodes to be discovered in HSM.
The following is an example of a failed test execution due to no discovered compute nodes in HSM:
Running functional tests...
============================= test session starts ==============================
platform linux -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /src/app, configfile: pytest.ini
plugins: tavern-1.23.1
collecting ... collected 38 items
...
test_components.tavern.yaml::Ensure that we can conduct a variety of queries on the Components collection FAILED [ 31%]
...
test_hardware.tavern.yaml::Query the Hardware collection for Node information FAILED [ 50%]
...
=================================== FAILURES ===================================
_ /src/app/test_components.tavern.yaml::Ensure that we can conduct a variety of queries on the Components collection _
...
------------------------------ Captured log call -------------------------------
WARNING tavern.util.dict_util:dict_util.py:46 Formatting 'xname' will result in it being coerced to a string (it is a <class 'NoneType'>)
...
_ /src/app/test_hardware.tavern.yaml::Query the Hardware collection for Node information _
...
Errors:
E tavern.util.exceptions.TestFailError: Test 'Retrieve the hardware information for a given node xname from the Hardware collection' failed:
- Status code was 404, expected 200:
{"type": "about:blank", "title": "Not Found", "detail": "no such xname.", "status": 404}
...
------------------------------ Captured log call -------------------------------
WARNING tavern.util.dict_util:dict_util.py:46 Formatting 'node_xname' will result in it being coerced to a string (it is a <class 'NoneType'>)
...
=========================== short test summary info ============================
FAILED test_components.tavern.yaml::Ensure that we can conduct a variety of queries on the Components collection
FAILED test_hardware.tavern.yaml::Query the Hardware collection for Node information
(ncn-mw#
) If these failures occur, confirm that there are no discovered compute nodes in HSM.
cray hsm state components list --type Node --role compute --format json
Example output:
{
"Components": []
}
There are several reasons why there may be no discovered compute nodes in HSM.
The following situations do not warrant additional troubleshooting and the test failures can be safely ignored if:
If none of the above cases are applicable, then the failures warrant additional troubleshooting:
(ncn-mw#
) Run the hsm_discovery_status_test.sh
script.
/opt/cray/csm/scripts/hms_verification/hsm_discovery_status_test.sh
If the script fails, this indicates a discovery issue and further troubleshooting steps to take are printed.
Otherwise, missing compute nodes in HSM with no discovery failures may indicate a problem with a leaf-bmc
switch.
(ncn-mw#
) Check to see if the leaf-bmc
switch resolves using the nslookup
command.
nslookup <leaf-bmc-switch>
Example output:
Server: 10.92.100.225
Address: 10.92.100.225#53
Name: sw-leaf-bmc-001.nmn
Address: 10.252.0.4
(ncn-mw#
) Verify connectivity to the leaf-bmc
switch.
ssh admin@<leaf-bmc-switch>
Example output:
ssh: connect to host sw-leaf-bmc-001 port 22: Connection timed out
Restoring connectivity, resolving configuration issues, or restarting the relevant ports on the leaf-bmc
switch should allow the compute hardware to issue DHCP requests and be discovered successfully.
test_components.tavern.yaml
These tests include checks for healthy node states and flags in HSM.
Hardware problems may cause Warning
flags to be set for nodes in HSM.
The following is an example of a failed test execution due to an unexpected flag set for a node in HSM:
Running tavern tests...
============================= test session starts ==============================
platform linux -- Python 3.9.16, pytest-7.1.2, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /src/app, configfile: pytest.ini
plugins: allure-pytest-2.12.0, tavern-1.23.1
collecting ... collected 37 items
...
test_components.tavern.yaml::Ensure that we can conduct a query for all Nodes in the Component collection FAILED [ 21%]
...
Errors:
E tavern.util.exceptions.TestFailError: Test 'Verify the expected response fields for all Nodes' failed:
- Error calling validate function '<function validate_pykwalify at 0x7f26a6e13820>':
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/tavern/schemas/files.py", line 106, in verify_generic
verifier.validate()
File "/usr/lib/python3.9/site-packages/pykwalify/core.py", line 194, in validate
raise SchemaError(u"Schema validation failed:\n - {error_msg}.".format(
pykwalify.errors.SchemaError: <SchemaError: error code 2: Schema validation failed:
- Enum 'Warning' does not exist. Path: '/Components/7/Flag' Enum: ['OK'].: Path: '/'>
...
=========================== short test summary info ============================
FAILED api/1-non-disruptive/test_components.tavern.yaml::Ensure that we can conduct a query for all Nodes in the Component collection
======================== 1 failed, 36 passed in 47.99s =========================
Test failures due to flags other than OK
set for nodes in HSM do not prevent CSM installations or upgrades from proceeding. It is safe to postpone the investigation and resolution of these failures until after the CSM installation or upgrade has completed.
These tests also include checks for healthy BMC states in HSM.
The following is an example of a failed test execution due to an unexpected BMC state in HSM:
Running functional tests...
============================= test session starts ==============================
platform linux -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /src/app, configfile: pytest.ini
plugins: tavern-1.23.1
collecting ... collected 38 items
...
test_components.tavern.yaml::Ensure that we can conduct a query for all Node BMCs in the Component collection FAILED [ 26%]
...
Errors:
E tavern.util.exceptions.TestFailError: Test 'Verify the expected response fields for all NodeBMCs' failed:
- Error calling validate function '<function validate_pykwalify at 0x7f22cbaf0700>':
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/tavern/schemas/files.py", line 106, in verify_generic
verifier.validate()
File "/usr/lib/python3.9/site-packages/pykwalify/core.py", line 194, in validate
raise SchemaError(u"Schema validation failed:\n - {error_msg}.".format(
pykwalify.errors.SchemaError: <SchemaError: error code 2: Schema validation failed:
- Enum 'Off' does not exist. Path: '/Components/1/State' Enum: ['Ready'].
...
=========================== short test summary info ============================
FAILED test_components.tavern.yaml::Ensure that we can conduct a query for all Node BMCs in the Component collection
=================== 1 failed, 37 passed in 214.09s (0:03:34) ===================
Test failures due to unexpected BMC states in HSM can be safely ignored if there are BMCs in the system that are intentionally powered off, such as during system shutdown and power off testing.
cray-hms-firmware-action-test-functional
This job executes the tests for the Firmware Action Service (FAS).
test_actions.tavern.yaml
These tests require at least one healthy BMC (State=Ready, Flag=OK) in HSM.
The following is an example of a failed test execution due to no healthy BMCs in HSM:
Running functional tests...
============================= test session starts ==============================
platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/schooler/Git/GitHub/hms-firmware-action/test/ct/api/1-non-disruptive, configfile: pytest.ini
plugins: tavern-1.23.3
collected 6 items
...
test_actions.tavern.yaml::Ensure that the BMC firmware can be updated with a FAS action FAILED [ 16%]
...
Errors:
E tavern.util.exceptions.TestFailError: Test 'Ensure that the BMC firmware can be updated with a FAS action' failed:
- Status code was 400, expected 202:
{"type": "about:blank", "detail": "invalid/duplicate xnames: [None]", "status": 400, "title": "Bad Request"}
...
=========================== short test summary info ============================
FAILED test_actions.tavern.yaml::Ensure that the BMC firmware can be updated with a FAS action
=================== 1 failed, 5 passed in 21.22s ===============================
Test failures due to no healthy BMCs in HSM can be safely ignored if the BMCs in the system are intentionally powered off, such as during system shutdown and power off testing.
hsm_discovery_status_test.sh
This test verifies that the system hardware has been discovered successfully.
The following is an example of a failed test execution:
Running hsm_discovery_status_test...
(22:19:34) Running 'kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}''...
(22:19:34) Running 'curl -k -i -s -S -d grant_type=client_credentials -d client_id=admin-client -d client_secret=<REDACTED> https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token'...
(22:19:35) Testing 'curl -s -k -H "Authorization: Bearer <REDACTED>" https://api-gw-service-nmn.local/apis/smd/hsm/v2/Inventory/RedfishEndpoints'...
(22:19:35) Processing response with: 'jq '.RedfishEndpoints[] | { ID: .ID, LastDiscoveryStatus: .DiscoveryInfo.LastDiscoveryStatus}' -c | sort -V | jq -c'...
(19:06:02) Verifying endpoint discovery statuses...
{"ID":"x3000c0s1b0","LastDiscoveryStatus":"HTTPsGetFailed"}
{"ID":"x3000c0s9b0","LastDiscoveryStatus":"ChildVerificationFailed"}
{"ID":"x3000c0s19b999","LastDiscoveryStatus":"HTTPsGetFailed"}
{"ID":"x3000c0s27b0","LastDiscoveryStatus":"ChildVerificationFailed"}
FAIL: hsm_discovery_status_test found 4 endpoints that failed discovery, maximum allowable is 1
'/opt/cray/csm/scripts/hms_verification/hsm_discovery_status_test.sh' exited with status code: 1
The expected state of LastDiscoveryStatus
is DiscoverOK
for all endpoints with the exception of the BMC for ncn-m001
, which is not normally connected to the site
network and therefore is expected to be HTTPsGetFailed
. If the test fails due to two or more endpoints having failed discovery, then perform the following additional steps in order to
determine the cause of the failure:
HTTPsGetFailed
(ncn-mw#
) Check to see if the failed component name (xname) resolves using the nslookup
command.
If not, then the problem may be a DNS issue.
nslookup <xname>
(ncn-mw#
) Check to see if the failed component name (xname) responds to the ping
command.
If not, then the problem may be a network or hardware issue.
ping -c 1 <xname>
(ncn-mw#
) Check to see if the failed component name (xname) responds to a Redfish query.
If not, then the problem may be a credentials issue. Use the password set in the REDS sealed secret.
curl -s -k -u root:<password> https://<xname>/redfish/v1/Managers | jq
If discovery failures for Gigabyte CMCs with component names (xnames) of the form xXc0sSb999
occur, then verify that the root
service account is configured for
the CMC and add it if needed. See
Add Root Service Account for Gigabyte Controllers.
If discovery failures for HPE PDUs with component names (xnames) of the form xXmM
occur, this may indicate that configuration steps have not yet been executed which
are required for the PDUs to be discovered. Refer to HPE PDU Administrative Procedures for additional
configuration for this type of PDU. The steps to run will depend on if the PDU has been set up yet, and whether or not an upgrade or fresh install of CSM is being performed.
ChildVerificationFailed
Check the SMD logs to determine the cause of the bad Redfish path encountered during discovery.
(ncn-mw#
) Get the SMD pod names.
kubectl -n services get pods -l app.kubernetes.io/name=cray-smd
Example output:
NAME READY STATUS RESTARTS AGE
cray-smd-5b9d574756-9b2lj 2/2 Running 0 24d
cray-smd-5b9d574756-bnztf 2/2 Running 0 24d
cray-smd-5b9d574756-hhc5p 2/2 Running 0 24d
(ncn-mw#
) Get the logs from each of the SMD pods.
kubectl -n services logs <cray-smd-pod1> cray-smd > smd_pod1_logs
kubectl -n services logs <cray-smd-pod2> cray-smd > smd_pod2_logs
kubectl -n services logs <cray-smd-pod3> cray-smd > smd_pod3_logs
DiscoveryStarted
The endpoint is in the process of being inventoried by Hardware State Manager (HSM). Wait for the current discovery operation to finish which should result in a new
LastDiscoveryStatus
state being set for the endpoint.
(ncn-mw#
) Use the following command to check the current discovery status of the endpoint:
cray hsm inventory redfishEndpoints describe <xname>
The HMS health checks include tests for multiple types of system components, some of which are critical for the installation of the system, while others are not.
The following types of HMS test failures should be considered blocking for system installations:
The following types of HMS test failures should not be considered blocking for system installations:
It is safe to postpone the investigation and resolution of non-blocking failures until after the CSM installation or upgrade has completed.