Troubleshoot Unresponsive kubectl Commands

Use this procedure to check if any kworkers are in an error state because of a high load. Once the error has been identified, workaround the issue by returning the high load to a normal level.

Symptoms

One or more of the following issues are possible symptoms of this issue.

  • The kubectl command can become unresponsive because of a high load.
  • ps aux cannot return or complete because of aspects of the /proc file system being locked.

If kubectl is non-responsive on any particular node, then commands can be run from any other master or worker non-compute node (NCN).

Procedure

In the following procedures, unless otherwise directed, run the commands on the node experiencing the issue. However, if kubectl is non-responsive on that node, run the kubectl commands from any other master or worker NCN.

Identify the kworker issue

  1. (ncn-mw#) Check to see if kubectl is not responding because of a kworker issue.

    1. List the process identification (PID) numbers of the kworkers in the D state.

      Processes in the D state are blocked on I/O and are not an issue unless they remain blocked indefinitely. Use the command below to see which PIDs remain stuck in this state.

      ps aux |grep [k]worker|grep -e " D"| awk '{ print $2 }'
      
    2. Show the stack for all kworkers in the D state.

      Note which kworkers clear and which ones remain stuck in this state over a period of time.

      for i in `ps aux | grep [k]worker | grep -e " D" | awk '{print $2}'` ; do
          cat "/proc/${i}/stack"; echo
      done
      
  2. (ncn-mw#) Check the load on the node and gather data for any PIDs consuming a lot of CPU.

    1. Monitor the processes and system resource usage.

      top
      

      Example output (some trailing lines omitted):

      top - 10:12:03 up 34 days, 17:31, 10 users,  load average: 7.39, 9.16, 10.99
      Tasks: 2155 total,   4 running, 2141 sleeping,   1 stopped,   9 zombie
      %Cpu(s):  4.3 us,  2.5 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
      MiB Mem : 257510.5+total, 69119.86+free, 89578.68+used, 98812.04+buff/cache
      MiB Swap:    0.000 total,    0.000 free,    0.000 used. 173468.1+avail Mem
      
          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         6105 root      20   0  193436 182772   2300 S 60.00 0.069  13485:54 lldpad
        49574 root      20   0 14.299g 495212  60896 S 47.54 0.188  31582:58 kubelet
            1 root      20   0  231236  19436   6572 S 38.69 0.007  16904:47 systemd
        43098 root      20   0 16.148g 652640  78748 S 38.69 0.248  18721:18 containerd
        20229 root      20   0   78980  14648   6448 S 35.08 0.006  15421:51 systemd
      1515295 1001      20   0 16.079g 5.439g  96312 S 11.48 2.163  12480:39 java
         4706 message+  20   0   41060   5620   3724 S 8.852 0.002   3352:38 dbus-daemon
      1282935 101       20   0  685476  38556  13748 S 6.557 0.015 262:09.88 patroni
        81539 root      20   0  300276 161372  26036 S 5.902 0.061   4145:40 mixs
        89619 root      20   0 4731796 498600  24144 S 5.902 0.189   2898:54 envoy
        85600 root      20   0 2292564 123596  23248 S 4.590 0.047   2211:58 envoy
      
    2. Generate a performance counter profile for the PIDs consuming a lot of CPU.

      In the following command, replace the PID value with the actual PID number.

      perf top -g -p PID
      

      Example output (some trailing lines omitted):

      Samples: 18  of event 'cycles', Event count (approx.): 4065227
        Children      Self  Shared Object     Symbol
      +   29.31%     9.77%  [kernel]          [k] load_balance
      +   19.54%    19.54%  [kernel]          [k] find_busiest_group
      +   11.17%    11.17%  kubelet           [.] 0x0000000000038d3c
      +    9.77%     9.77%  [kernel]          [k] select_task_rq_fair
      +    9.77%     9.77%  [kernel]          [k] cpuacct_charge
      
    3. Verify that ps -ef completes.

      ps -ef
      
  3. (ncn-mw#) Check the /var/log/messages file on the node to see if there are any errors.

    grep -i error /var/log/messages
    

    Example output (some trailing lines omitted):

    <nil>"
    2020-07-19T07:19:34.485659+00:00 ncn-w001 containerd[43098]: time="2020-07-19T07:19:34.485540765Z" level=info msg="Exec process \"9946991ef8108d21c163a04c9085fd15a60e3991b8e9d7b2250a071df9b6cbb8\" exits with exit code 0 and error
    <nil>"
    2020-07-19T07:19:38.468970+00:00 ncn-w001 containerd[43098]: time="2020-07-19T07:19:38.468818388Z" level=info msg="Exec process \"e6fe9ccbb1127a77f8c9db84b339dafe068f9e08579962f790ebf882ee35e071\" exits with exit code 0 and error
    <nil>"
    2020-07-19T07:19:44.440413+00:00 ncn-w001 containerd[43098]: time="2020-07-19T07:19:44.440243465Z" level=info msg="Exec process \"7a3cf826f008c37bd0fe89382561af42afe37ac4d52f37ce9312cc950248f4da\" exits with exit code 0 and error
    <nil>"
    2020-07-19T07:20:02.442421+00:00 ncn-w001 containerd[43098]: time="2020-07-19T07:20:02.442266943Z" level=error msg="StopPodSandbox for \"d449618d075b918fd6397572c79bd758087b31788dd8bf40f4dc10bb1a013a68\" failed" 
        error="failed to destroy network for sandbox \"d449618d075b918fd6397572c79bd758087b31788dd8bf40f4dc10bb1a013a68\": Multus: Err in getting k8s network from pod: getPodNetworkAnnotation: failed to query the pod sma-monasca-agent-xkxnj in out of cluster comm: pods \"sma-monasca-agent-xkxnj\" not found"
    2020-07-19T07:20:04.440834+00:00 ncn-w001 containerd[43098]: time="2020-07-19T07:20:04.440742542Z" level=info msg="Exec process \"2a751ca1453d7888be88ab4010becbb0e75b7419d82e45ca63e55e4155110208\" exits with exit code 0 and error
    <nil>"
    2020-07-19T07:20:06.587325+00:00 ncn-w001 containerd[43098]: time="2020-07-19T07:20:06.587133372Z" level=error msg="collecting metrics for bf1d562e060ba56254f5f5ea4634ef4ae189abb462c875e322c3973b83c4c85d" error="ttrpc: closed: unknown"
    2020-07-19T07:20:14.450624+00:00 ncn-w001 containerd[43098]: time="2020-07-19T07:20:14.450547541Z" level=info msg="Exec process \"ceb384f1897d742134e7d2c9da5a62650ed1274f0ee4c5a17fa9cac1a24b6dc4\" exits with exit code 0 and error
    

Recovery steps

  1. (ncn-mw#) Restart the kubelet on the node with the issue.

    systemctl restart kubelet
    

    If restarting the kubelet did not resolve the issue, then proceed to the next step.

  2. (ncn-mw#) Restart the container runtime environment on the node with the issue.

    This will likely hang or fail to complete without a timeout. If that is the case, then cancel the command with control-C and proceed to the next step.

    systemctl restart containerd
    
  3. (ncn#) Reboot the node with the issue.

    The node must be rebooted if the remediation of restarting kubelet and containerd did not resolve the kworker and high load average issue.

    IMPORTANT: Do NOT run the commands in this step on the node experiencing the problem. Access to that node will be cut off when it is powered off.

    Replace NCN_NAME in the commands below with the node experiencing the issue. In this example, it is ncn-w999.

    read -s is used to prevent the password from being written to the screen or the shell history.

    NCN_NAME=ncn-w999
    USERNAME=root
    read -r -s -p "${NCN_NAME} BMC ${USERNAME} password: " IPMI_PASSWORD
    export IPMI_PASSWORD    
    ipmitool -U "${USERNAME}" -E -I lanplus -H "${NCN_NAME}-mgmt" power off; sleep 5;
    ipmitool -U "${USERNAME}" -E -I lanplus -H "${NCN_NAME}-mgmt" power show; echo
    ipmitool -U "${USERNAME}" -E -I lanplus -H "${NCN_NAME}-mgmt" power on; sleep 5;
    ipmitool -U "${USERNAME}" -E -I lanplus -H "${NCN_NAME}-mgmt" power show; echo
    
  4. Watch the console of the node being rebooted.

    This can be done using the Cray console service or with ipmitool.

    • The recommended method is to use the Cray console service. See Log in to a Node Using ConMan.

    • (ncn#) Alternatively, the console can be accessed by using ipmitool.

      ipmitool -U "${USERNAME}" -E -I lanplus -H "${NCN_NAME}-mgmt" sol activate
      

      This command will not return anything, but will show the ttyS0 console of the node. Use ~. to disconnect. NOTE The same ~. keystroke can also break an SSH session. After doing this, the connection to the SSH session may need to be reestablished.

  5. (ncn-mw#) Try running a kubectl command on the node where it was previously unresponsive.