This document describes how to troubleshoot CSM validation test failures due to no discovered compute nodes in HSM.
(ncn-mw#
) Confirm that there are no discovered compute nodes in HSM.
cray hsm state components list --type Node --role compute --format json
Example output:
{
"Components": []
}
There are several reasons why there may be no discovered compute nodes in HSM.
The following situations do not warrant additional troubleshooting and related test failures can be safely ignored if:
If none of the above cases are applicable, then the test failures warrant additional troubleshooting:
(ncn-mw#
) Run the hsm_discovery_status_test.sh
script.
/opt/cray/csm/scripts/hms_verification/hsm_discovery_status_test.sh
If the script fails, this indicates a discovery issue and further troubleshooting steps to take are printed.
Otherwise, missing compute nodes in HSM with no discovery failures may indicate a problem with a leaf-bmc
switch.
(ncn-mw#
) Check to see if the leaf-bmc
switch resolves using the nslookup
command.
nslookup <leaf-bmc-switch>
Example output:
Server: 10.92.100.225
Address: 10.92.100.225#53
Name: sw-leaf-bmc-001.nmn
Address: 10.252.0.4
(ncn-mw#
) Verify connectivity to the leaf-bmc
switch.
ssh admin@<leaf-bmc-switch>
Example output:
ssh: connect to host sw-leaf-bmc-001 port 22: Connection timed out
Restoring connectivity, resolving configuration issues, or restarting the relevant ports on the leaf-bmc
switch should allow the compute hardware to issue DHCP requests and be discovered successfully.