BOS Operator Pods OOMKilled

Summary

On large scale systems with thousands of nodes, if the Boot Orchestration Service (BOS) has debug logging enabled, then it is possible for some BOS operator Kubernetes pods to be OOMKilled when trying to log particularly large API responses.

On large enough systems, it is possible for this to happen even without debug logging enabled.

Details

The BOS logging level is one of numerous Options that an administrator may customize.

Workaround

(ncn-mw#) Use the following procedure to work around the problem.

Check if debug logging is enabled.

cray bos v2 options list --format json
  • If debug logging is enabled, then the easiest workaround is to set the BOS logging level to INFO or higher.

    cray bos v2 options update --logging-level INFO
    
  • If debug logging is not enabled, or if the problem persists after disabling it, then the only other option is to increase the memory limits for the pods experiencing this problem. See Increase Pod Resource Limits.

Fix

This problem is fixed in CSM 1.7, by changing how BOS performs its debug logging. The fix is not backported to earlier CSM versions. Prior to CSM 1.7, the above Workaround must be used if the issue is encountered.