Console Services Troubleshooting Guide

There are many things that can prevent the ConMan service from successfully connecting to a single node console or can cause problems with the entire deployment. This is a guide on how to look at all aspects of the service to determine what the current problem is.

Prerequisites

(ncn-mw#) The user performing these procedures needs to have access permission to the cray-console-operator and cray-console-node pods. There will be a lot of interaction with the cray-console-operator pod, so set up an environment variable to refer to that pod:

OP_POD=$(kubectl get pods -n services -o wide|grep cray-console-operator|awk '{print $1}')

Check the states of the console pods

There are a number of pods that work together to provide the console services. If any of the pods are not working correctly, it will impact the ability to connect to specific consoles and monitor the console logs.

  1. (ncn-mw#) Query Kubernetes to inspect the console pods:

    kubectl -n services get pods | grep console
    

    Example output:

    cray-console-data-5448f99fb-mbrf4             2/2     Running        0     18d
    cray-console-data-postgres-0                  3/3     Running        0     18d
    cray-console-data-postgres-1                  3/3     Running        0     18d
    cray-console-data-postgres-2                  3/3     Running        0     18d
    cray-console-data-wait-for-postgres-1-rnrt6   0/2     Completed      0     18d
    cray-console-node-0                           3/3     Running        0     18d
    cray-console-node-1                           3/3     Running        0     18d
    cray-console-operator-6d4d5b84d9-rwsdd        2/2     Running        3     25d
    

    There should be one cray-console-operator pod.

    There should be multiple cray-console-node pods. A standard deployment will start with two pods and scale up from there depending on the size of the system and the configuration.

    There should be one cray-console-data pod and three cray-console-data-postgres pods.

    All pods should be in the Completed or Running state. If pods are in any other state, then use the usual Kubernetes techniques to find out what is wrong with those pods.

Find the cray-console-node pod for a specific node

The first thing to check is if the specific node is assigned to a cray-console-node pod that should be monitoring the node for log traffic and providing a means to interact with the console.

  1. (ncn-mw#) Set the component name (xname) of the node whose console is being checked.

    XNAME="xName of the node - e.g. x3000c0s19b2n0"
    
  2. (ncn-mw#) Find which cray-console-node pod the node is assigned to.

    NODEPOD=$(kubectl -n services exec $OP_POD -c cray-console-operator -- \
        sh -c "/app/get-node $XNAME" | jq .podname | sed 's/"//g')
    echo $NODEPOD
    

    The returned node pod should be one of the cray-console-node pods see in the listing of pods above. An expected output from the above would be:

    cray-console-node-0
    

    If this is the case, proceed to Troubleshoot ConMan Failing to Connect to a Console.

    If the node is not assigned to a cray-console-node pod, then the result will have an invalid pod name. For example:

    cray-console-node-
    

    In this case, proceed to Investigate service problem to find out why the node is not assigned to a pod.

Investigate service problem

When the entire service is having problems, the next step is to determine which component is causing the issue. All three services need to work together to provide console connections.

  1. Check the underlying database.

    Sometimes the cray-console-data pods can report healthy, but the actual Postgres instance can be unhealthy. See Investigate Postgres deployment for information on how to investigate further.

  2. (ncn-mw#) Restart the cray-console-operator pod.

    There are rare cases where the cray-console-operator pod may be reporting as Running to Kubernetes, but actually be unhealthy. In this case a restart of the pod will resolve the issue and start the communication between the services again.

    1. Restart the pod.

      kubectl -n services delete pod $OP_POD
      
    2. Wait for the new cray-console-operator pod to reach a Running state.

      kubectl -n services get pods | grep cray-console-operator
      

      Example output when ready to proceed:

      cray-console-operator-6d4d5b84d9-66svs       2/2     Running     0    60s
      
    3. Now there is a different pod name, so the OP_POD variable needs to be set again.

      OP_POD=$(kubectl get pods -n services -o wide|grep cray-console-operator|awk '{print $1}')
      
    4. Wait several minutes, then see if the issue is resolved.

  3. Restart the entire set of services.

    To restart everything from scratch, follow the directions in Complete Reset of the Console Services.

Investigate Postgres deployment

Sometimes the database that is holding the current status information for the console services has problems that keep it from saving and reporting data. Depending on when this happens, the other services may be different states of managing node consoles. The cray-console-node pods will continue to monitor the nodes that have been assigned to them, but if the pod restarts or new nodes are added to the system, they will not be able to get new nodes assigned to the currently running pods. This may lead to some cray-console-node pods continuing to monitor nodes, but other pods not having any nodes assigned to them.

NOTE There is no persistent data in the cray-console-data Postgres database. It only contains current state information and will rebuild itself automatically once it is functional again. There is no need to save or restore data from this database.

Check on the current running state of the cray-console-data-postgres database.

  1. (ncn-mw#) Find the cray-console-data-postgres pods and note one that is in Running state.

    kubectl -n services get pods | grep cray-console-data-postres
    

    Example output:

    cray-console-data-postgres-0    3/3     Running  0  26d
    cray-console-data-postgres-1    3/3     Running  0  26d
    cray-console-data-postgres-2    3/3     Running  0  26d
    
  2. (ncn-mw#) Log into one of the healthy pods.

    DATA_PG_POD=cray-console-data-postgres-1
    kubectl -n services exec -it $DATA_PG_POD -c postgres -- sh
    
  3. (pod#) Check the status of the database.

    patronictl list
    

    Expected result for a healthy database:

    + Cluster: cray-console-data-postgres (7244964360609890381) ---+----+-----------+
    |            Member            |    Host    |  Role  |  State  | TL | Lag in MB |
    +------------------------------+------------+--------+---------+----+-----------+
    | cray-console-data-postgres-0 | 10.43.0.8  | Leader | running |  1 |           |
    | cray-console-data-postgres-1 | 10.37.0.45 |        | running |  1 |         0 |
    | cray-console-data-postgres-2 | 10.32.0.52 |        | running |  1 |         0 |
    +------------------------------+------------+--------+---------+----+-----------+
    

    Example output if replication is broken:

    + Cluster: cray-console-data-postgres (7244964360609890381) ----+----+-----------+
    |            Member            |    Host    |  Role  |  State   | TL | Lag in MB |
    +------------------------------+------------+--------+----------+----+-----------+
    | cray-console-data-postgres-0 | 10.43.0.8  |        | starting |    |   unknown |
    | cray-console-data-postgres-1 | 10.37.0.45 | Leader | running  | 47 |         0 |
    | cray-console-data-postgres-2 | 10.32.0.52 |        | running  | 14 |       608 |
    +------------------------------+------------+--------+---------+----+-----------+
    

    Example output if the leader is missing:

    + Cluster: cray-console-data-postgres (7244964360609890381) --------+----+-----------+
    |            Member            |    Host    |  Role  |  State       | TL | Lag in MB |
    +------------------------------+------------+--------+--------------+----+-----------+
    | cray-console-data-postgres-0 | 10.43.0.8  |        | running      |    |   unknown |
    | cray-console-data-postgres-1 | 10.37.0.45 |        | start failed |    |   unknown |
    | cray-console-data-postgres-2 | 10.32.0.52 |        | start failed |    |   unknown |
    +------------------------------+------------+--------+--------------+----+-----------+
    

If any of the replicas are showing a problem, look at the following troubleshooting pages to attempt to fix the Postgres instance:

If the database can not be made healthy through these procedures, the easiest way to resolve this is to perform a complete reset of the console services including reinstalling the cray-console-data service. See Complete Reset of the Console Services.