Troubleshoot issues where Ansible hangs.
A CFS session or pod is failing to complete, and the Ansible logs are not showing progress or completion.
Hung sessions are usually a result of filesystem issues, such as problems with DVS, on the nodes that CFS is attempting to configure.
An issue on even one of the nodes that a session is attempting to configure can cause the whole session to hang (unless Ansible is specifically configured to use a free
rather than linear
strategy).
(ncn-mw#
) Find all the nodes that the Ansible session is targeting.
In most cases this can be done by looking at the limit
parameter for the session.
When starting with a pod name, run:
kubectl -n services describe pod $POD_NAME | grep -A 5 ANSIBLE_ARGS
When starting with a session name, run:
cray cfs v3 sessions describe "$SESSION_NAME" --format json | jq .ansible.limit
Check each node.
For each node found in the previous step, SSH to the node and run some commands to test the node. If possible and safe to do so, determine what command Ansible was attempting to run and replicate that command as closely as possible. Depending on the exact problem the behavior at the step may vary, but the goal is to find a node where a command hangs.
If no filesystem issues are found and CFS is generating a large amount of Ansible output, try reducing the amount of output in one of the following ways. The steps in this procedure are independent from each other and are used to troubleshoot different underlying problems that both present a similar symptom.
Any of the following steps can be taken to help reduce the output generated by Ansible:
Reduce the verbosity when running Ansible commands.
If the session was created with a higher value of --ansible-verbosity
(three or higher), Ansible can generate a lot of output
that can cause the pod to hang. Reducing the verbosity by one or more may resolve this issue.
Update the Ansible configuration to produce less output.
See Enable Ansible Profiling for an example of modifying the configuration.
Adjust the use of flags used when running Ansible commands.
The display_ok_hosts
and display_skipped_hosts
are examples of settings that can be disabled to reduce output.
See the Ansible documentation for more information on what flags can be used.