BOS Sessions Stuck Pending

Summary

If a Boot Orchestration Service (BOS) Session is created using a Session Template that indirectly refers to invalid xnames, then this can prevent the BOS session-setup operator from moving any sessions out of the pending state.

Symptoms

The primary symptom is that new BOS sessions will remain in the pending state and never progress. If the cray-bos-operator-session-setup pod logs are viewed, it will repeatedly log errors, every time it tries to process sessions.

Details

This can only happen when a session template is created that includes Node groups in a boot set. Specifically, this problem happens if the session template specifies a Hardware State Manager (HSM) component group that contains xnames that do not exist as BOS Components.

Remediation

The solution is to delete the pending BOS sessions that are using these session templates, or to correct the session templates (or corresponding HSM groups).

Once this has been done for all such sessions, then the problem is resolved and BOS sessions will proceed as normal.

(ncn-mw#) Follow this procedure to identify and delete these sessions.

  1. List all pending BOS sessions.

    cray bos v2 sessions list --status pending --format json
    
  2. For each listed session, describe its corresponding session template.

    cray bos v2 sessiontemplates describe <template_name> --format json
    
  3. If the session template contains any boot sets with node_groups fields, list the members of the corresponding groups.

    cray hsm groups members list <group_label> --format json
    
  4. For each xname listed, verify that the corresponding BOS component exists.

    cray bos v2 components describe <xname>
    
  5. If any xname does not exist as a BOS component, then do one of the following:

Fix

This problem is fixed in CSM 1.7, by modifying BOS to ignore any invalid xnames. The fix is not backported to earlier CSM versions. Prior to CSM 1.7, the above Remediation must be used if the issue is encountered.