There are many things that can prevent the ConMan service from successfully connecting to a single node console or can cause problems with the entire deployment. This is a guide on how to look at all aspects of the service to determine what the current problem is.
cray-console-node
pod for a specific node(ncn-mw#
) The user performing these procedures needs to have access permission to the cray-console-operator
and cray-console-node
pods. There will be a lot of interaction with the cray-console-operator
pod,
so set up an environment variable to refer to that pod:
OP_POD=$(kubectl get pods -n services -o wide|grep cray-console-operator|awk '{print $1}')
There are a number of pods that work together to provide the console services. If any of the pods are not working correctly, it will impact the ability to connect to specific consoles and monitor the console logs.
(ncn-mw#
) Query Kubernetes to inspect the console pods:
kubectl -n services get pods | grep console
Example output:
cray-console-data-5448f99fb-mbrf4 2/2 Running 0 18d
cray-console-data-postgres-0 3/3 Running 0 18d
cray-console-data-postgres-1 3/3 Running 0 18d
cray-console-data-postgres-2 3/3 Running 0 18d
cray-console-data-wait-for-postgres-1-rnrt6 0/2 Completed 0 18d
cray-console-node-0 3/3 Running 0 18d
cray-console-node-1 3/3 Running 0 18d
cray-console-operator-6d4d5b84d9-rwsdd 2/2 Running 3 25d
There should be one cray-console-operator
pod.
There should be multiple cray-console-node
pods. A standard deployment will start with two
pods and scale up from there depending on the size of the system and the configuration.
There should be one cray-console-data
pod and three cray-console-data-postgres
pods.
All pods should be in the Completed
or Running
state. If pods are in any other state, then
use the usual Kubernetes techniques to find out what is wrong with those pods.
cray-console-node
pod for a specific nodeThe first thing to check is if the specific node is assigned to a cray-console-node
pod that should
be monitoring the node for log traffic and providing a means to interact with the console.
(ncn-mw#
) Set the component name (xname) of the node whose console is being checked.
XNAME="xName of the node - e.g. x3000c0s19b2n0"
(ncn-mw#
) Find which cray-console-node
pod the node is assigned to.
NODEPOD=$(kubectl -n services exec $OP_POD -c cray-console-operator -- \
sh -c "/app/get-node $XNAME" | jq .podname | sed 's/"//g')
echo $NODEPOD
The returned node pod should be one of the cray-console-node
pods see in the listing
of pods above. An expected output from the above would be:
cray-console-node-0
If this is the case, proceed to Troubleshoot ConMan Failing to Connect to a Console.
If the node is not assigned to a cray-console-node
pod, then the result will have an invalid pod name. For example:
cray-console-node-
In this case, proceed to Investigate service problem to find out why the node is not assigned to a pod.
When the entire service is having problems, the next step is to determine which component is causing the issue. All three services need to work together to provide console connections.
Check the underlying database.
Sometimes the cray-console-data
pods can report healthy, but the actual Postgres instance
can be unhealthy. See Investigate Postgres deployment for
information on how to investigate further.
(ncn-mw#
) Restart the cray-console-operator
pod.
There are rare cases where the cray-console-operator
pod may be reporting as Running
to Kubernetes, but actually be unhealthy. In this case a restart of the pod will resolve
the issue and start the communication between the services again.
Restart the pod.
kubectl -n services delete pod $OP_POD
Wait for the new cray-console-operator
pod to reach a Running
state.
kubectl -n services get pods | grep cray-console-operator
Example output when ready to proceed:
cray-console-operator-6d4d5b84d9-66svs 2/2 Running 0 60s
Now there is a different pod name, so the OP_POD
variable needs to be set again.
OP_POD=$(kubectl get pods -n services -o wide|grep cray-console-operator|awk '{print $1}')
Wait several minutes, then see if the issue is resolved.
Restart the entire set of services.
To restart everything from scratch, follow the directions in Complete Reset of the Console Services.
Sometimes the database that is holding the current status information for the console services has
problems that keep it from saving and reporting data. Depending on when this happens, the other
services may be different states of managing node consoles. The cray-console-node
pods will continue
to monitor the nodes that have been assigned to them, but if the pod restarts or new nodes are added
to the system, they will not be able to get new nodes assigned to the currently running pods. This may
lead to some cray-console-node
pods continuing to monitor nodes, but other pods not having any nodes
assigned to them.
NOTE
There is no persistent data in thecray-console-data
Postgres database. It only contains current state information and will rebuild itself automatically once it is functional again. There is no need to save or restore data from this database.
Check on the current running state of the cray-console-data-postgres
database.
(ncn-mw#
) Find the cray-console-data-postgres
pods and note one that is in Running
state.
kubectl -n services get pods | grep cray-console-data-postres
Example output:
cray-console-data-postgres-0 3/3 Running 0 26d
cray-console-data-postgres-1 3/3 Running 0 26d
cray-console-data-postgres-2 3/3 Running 0 26d
(ncn-mw#
) Log into one of the healthy pods.
DATA_PG_POD=cray-console-data-postgres-1
kubectl -n services exec -it $DATA_PG_POD -c postgres -- sh
(pod#
) Check the status of the database.
patronictl list
Expected result for a healthy database:
+ Cluster: cray-console-data-postgres (7244964360609890381) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------------+------------+--------+---------+----+-----------+
| cray-console-data-postgres-0 | 10.43.0.8 | Leader | running | 1 | |
| cray-console-data-postgres-1 | 10.37.0.45 | | running | 1 | 0 |
| cray-console-data-postgres-2 | 10.32.0.52 | | running | 1 | 0 |
+------------------------------+------------+--------+---------+----+-----------+
Example output if replication is broken:
+ Cluster: cray-console-data-postgres (7244964360609890381) ----+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------------+------------+--------+----------+----+-----------+
| cray-console-data-postgres-0 | 10.43.0.8 | | starting | | unknown |
| cray-console-data-postgres-1 | 10.37.0.45 | Leader | running | 47 | 0 |
| cray-console-data-postgres-2 | 10.32.0.52 | | running | 14 | 608 |
+------------------------------+------------+--------+---------+----+-----------+
Example output if the leader is missing:
+ Cluster: cray-console-data-postgres (7244964360609890381) --------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------------+------------+--------+--------------+----+-----------+
| cray-console-data-postgres-0 | 10.43.0.8 | | running | | unknown |
| cray-console-data-postgres-1 | 10.37.0.45 | | start failed | | unknown |
| cray-console-data-postgres-2 | 10.32.0.52 | | start failed | | unknown |
+------------------------------+------------+--------+--------------+----+-----------+
If any of the replicas are showing a problem, look at the following troubleshooting pages to attempt to fix the Postgres instance:
If the database can not be made healthy through these procedures, the easiest way to
resolve this is to perform a complete reset of the console services including
reinstalling the cray-console-data
service. See
Complete Reset of the Console Services.