If a Boot Orchestration Service (BOS) Session
is created using a Session Template that indirectly refers to invalid
xnames, then this can prevent the BOS session-setup
operator
from moving any sessions out of the pending
state.
The primary symptom is that new BOS sessions will remain in the pending
state and never progress.
If the cray-bos-operator-session-setup
pod logs are viewed, it will repeatedly log errors, every time
it tries to process sessions.
This can only happen when a session template is created that includes Node groups in a boot set. Specifically, this problem happens if the session template specifies a Hardware State Manager (HSM) component group that contains xnames that do not exist as BOS Components.
The solution is to delete the pending
BOS sessions that are using these session templates, or to correct
the session templates (or corresponding HSM groups).
Once this has been done for all such sessions, then the problem is resolved and BOS sessions will proceed as normal.
(ncn-mw#
) Follow this procedure to identify and delete these sessions.
List all pending
BOS sessions.
cray bos v2 sessions list --status pending --format json
For each listed session, describe its corresponding session template.
cray bos v2 sessiontemplates describe <template_name> --format json
If the session template contains any boot sets with node_groups
fields, list the members of the corresponding groups.
cray hsm groups members list <group_label> --format json
For each xname listed, verify that the corresponding BOS component exists.
cray bos v2 components describe <xname>
If any xname does not exist as a BOS component, then do one of the following:
This problem is fixed in CSM 1.7, by modifying BOS to ignore any invalid xnames. The fix is not backported to earlier CSM versions. Prior to CSM 1.7, the above Remediation must be used if the issue is encountered.