View the Kubernetes logs for a Configuration Framework Service (CFS) pod in an error state to determine whether the error resulted from the CFS infrastructure or from an Ansible play that was run by a specific configuration layer in a CFS session.
Use this procedure to obtain important triage information for Ansible plays being called by CFS.
Find the CFS pod that is in an error state.
List all CFS pods in error state.
ncn-mw# kubectl get pods -n services | grep -E "^cfs-.*[[:space:]]Error[[:space:]]"
Example output:
cfs-e8e48c2a-448f-4e6b-86fa-dae534b1702e-pnxmn 0/3 Error 0 25h
Set CFS_POD_NAME
to the name of the pod to be investigated.
Use the pod name identified in the previous substep.
ncn-mw# CFS_POD_NAME=cfs-e8e48c2a-448f-4e6b-86fa-dae534b1702e-pnxmn
Check to see what containers are in the pod.
ncn-mw# kubectl logs -n services "${CFS_POD_NAME}"
Example output:
Error from server (BadRequest): a container name must be specified for pod cfs-e8e48c2a-448f-4e6b-86fa-dae534b1702e-pnxmn, choose one of: [inventory ansible-0 istio-proxy] or one of the init containers: [git-clone-0 istio-init]
Issues rarely occur in the istio-init
and istio-proxy
containers. These containers can be ignored for now.
Check the git-clone-0
, inventory
, and ansible-0
containers, in that order.
If there are additional Ansible pods, examine those as well, in ascending order.
Check the git-clone-0
container.
ncn-mw# kubectl logs -n services "${CFS_POD_NAME}" git-clone-0
Check the inventory
container.
ncn-mw# kubectl logs -n services "${CFS_POD_NAME}" inventory
Example output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 15000: Connection refused
Waiting for Sidecar
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
cache-control: no-cache, max-age=0
x-content-type-options: nosniff
date: Thu, 05 Dec 2019 15:00:11 GMT
server: envoy
transfer-encoding: chunked
Sidecar available
2019-12-05 15:00:12,160 - INFO - cray.cfs.inventory - Starting CFS Inventory version=0.4.3, namespace=services
2019-12-05 15:00:12,171 - INFO - cray.cfs.inventory - Inventory target=dynamic for cfsession=boa-2878e4c0-39c2-4df0-989e-053bb1edee0c
2019-12-05 15:00:12,227 - INFO - cray.cfs.inventory.dynamic - Dynamic inventory found a total of 2 groups
2019-12-05 15:00:12,227 - INFO - cray.cfs.inventory - Writing out the inventory to /inventory/hosts
Check the ansible-0
container.
Look towards the end of the Ansible log in the PLAY RECAP
section to see if any targets failed.
If a target failed, then look above in the log at the immediately preceding play.
In the example below, the ncmp_hsn_cns
role has an issue when being run against the compute nodes.
ncn-mw# kubectl logs -n services "${CFS_POD_NAME}" ansible-0
Example output:
Waiting for Inventory
Waiting for Inventory
Inventory available
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
[...]
TASK [ncmp_hsn_cns : SLES Compute Nodes (HSN): Create/Update ifcfg-hsnx File(s)] ***
fatal: [x3000c0s19b1n0]: FAILED! => {"msg": "'interfaces' is undefined"}
fatal: [x3000c0s19b2n0]: FAILED! => {"msg": "'interfaces' is undefined"}
fatal: [x3000c0s19b3n0]: FAILED! => {"msg": "'interfaces' is undefined"}
fatal: [x3000c0s19b4n0]: FAILED! => {"msg": "'interfaces' is undefined"}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
x3000c0s19b1n0 : ok=28 changed=20 unreachable=0 failed=1 skipped=77 rescued=0 ignored=1
x3000c0s19b2n0 : ok=27 changed=19 unreachable=0 failed=1 skipped=63 rescued=0 ignored=1
x3000c0s19b3n0 : ok=27 changed=19 unreachable=0 failed=1 skipped=63 rescued=0 ignored=1
x3000c0s19b4n0 : ok=27 changed=19 unreachable=0 failed=1 skipped=63 rescued=0 ignored=1
Run the Ansible play again once the underlying issue has been resolved.