The first step in debugging compute node boot-related issues is to determine the underlying cause and the stage that the issue was encountered at.
BOS v2 provides rich, per-component records for underlying actions that have been applied as part of BOS session provisioning. Often, it is helpful to observe the set of operations that BOS has enacted on behalf of a session as they apply to a single failing node. The records of operations that have been applied for a node, as well as the intended next steps, can be viewed through the BOS v2 component information for the affected hardware.
(ncn-mw#
) Verify the status of a BOS component.
cray bos v2 components describe x3000c0s1b0n0
enabled = false
error = ""
id = "x3000c0s1b0n0"
session = ""
[snip]
This command coupled with the Linux watch
command are an often used way to get continued updates on the most recent
actions applied to the node.
If a node has been booted with BOS as part of a boot or reboot operation, and the node was powered on, but has not
begun configuring, the node may be stuck in early initialization (Failure to iPXE
chain, network setup issues, failure to
obtain a root filesystem, or other dracut module specific issues). In this case, it is best to connect to the node’s
console logs to obtain specific information about the failed node. To learn more about ConMan, refer to
ConMan. A node’s console data can be accessed through its log file, as described in
Access Compute Node Logs). This information can also be accessed by connecting
to the node’s console with ipmitool
. Refer to online documentation to learn more about using ipmitool
.
If the node has booted into a multi-user target phase, but BOS has not completed booting the node, the node may have
encountered a configuration error. A similar set of records for configuration for a given node can be obtained from
CFS endpoint for the same hardware component. In this case, BOS will indicate the
component status is configuring
, and further querying information from CFS for the same component may be in order.
(ncn-mw#
) Verify the configuration status of a CFS component of the same name.
cray cfs v3 components describe x3000c0s1b0n0 --state-details true --format toml
Example output:
configuration_status = "configured"
desired_config = "management-1.4"
enabled = false
error_count = 0
id = "x3000c0s1b0n0"
logs = "ara.cmn.site/hosts?name=x3000c0s1b0n0"
[[state]]
clone_url = "https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git"
commit = "ae77176a946cc06aabde32e53815dc4dea8039dd"
last_updated = "2023-03-02T13:58:05Z"
playbook = "site.yml"
session_name = "batcher-2df030b8-1bc5-4afb-ac29-df93815473f2"
Here, session_name
corresponds to the CFS session that is acting on the CFS component(x3000c0s1b0n0
), and not the BOS
session.