Troubleshoot CFS Session Failing to Complete

Troubleshoot issues where Ansible hangs.

Prerequisites

A CFS session or pod is failing to complete, and the Ansible logs are not showing progress or completion.

Check for filesystem issues

Hung sessions are usually a result of filesystem issues, such as problems with DVS, on the nodes that CFS is attempting to configure. An issue on even one of the nodes that a session is attempting to configure can cause the whole session to hang (unless Ansible is specifically configured to use a free rather than linear strategy).

  1. (ncn-mw#) Find all the nodes that the Ansible session is targeting.

    In most cases this can be done by looking at the limit parameter for the session.

    • When starting with a pod name, run:

      kubectl -n services describe pod $POD_NAME | grep -A 5 ANSIBLE_ARGS
      
    • When starting with a session name, run:

      cray cfs v3 sessions describe "$SESSION_NAME" --format json | jq .ansible.limit
      
  2. Check each node.

    For each node found in the previous step, SSH to the node and run some commands to test the node. If possible and safe to do so, determine what command Ansible was attempting to run and replicate that command as closely as possible. Depending on the exact problem the behavior at the step may vary, but the goal is to find a node where a command hangs.

Reduce Ansible output

If no filesystem issues are found and CFS is generating a large amount of Ansible output, try reducing the amount of output in one of the following ways. The steps in this procedure are independent from each other and are used to troubleshoot different underlying problems that both present a similar symptom.

Any of the following steps can be taken to help reduce the output generated by Ansible:

  • Reduce the verbosity when running Ansible commands.

    If the session was created with a higher value of --ansible-verbosity (three or higher), Ansible can generate a lot of output that can cause the pod to hang. Reducing the verbosity by one or more may resolve this issue.

  • Update the Ansible configuration to produce less output.

    See Enable Ansible Profiling for an example of modifying the configuration.

  • Adjust the use of flags used when running Ansible commands.

    The display_ok_hosts and display_skipped_hosts are examples of settings that can be disabled to reduce output. See the Ansible documentation for more information on what flags can be used.