The ConMan pod can be used to establish a serial console connection with each non-compute node (NCN) in the system.
In the scenario of a power down or reboot of an NCN worker, one must first determine if any cray-console pods
are running on that NCN. It is important to move cray-console pods to other worker nodes before rebooting or
powering off a worker node to minimize disruption in console logging. If a brief interruption in console logging
and interactive access is acceptable while the NCN worker is being drained, then the evacuation may be skipped.
If a cray-console-node pod is running on a worker node when it is powered off or rebooted, then access to its
associated consoles will be unavailable until one of the following things happens:
cray-console-node pod begins running on it.cray-console-node pod is terminated and comes up on another worker node.cray-console-operator pod assigns the associated consoles to a different cray-console-node pod.The user performing these procedures needs to have access permission to the cray-console-operator and cray-console-node pods.
(ncn-mw#) Find the cray-console-operator pod.
OP_POD=$(kubectl get pods -n services \
-o wide|grep cray-console-operator|awk '{print $1}')
echo $OP_POD
Example output:
cray-console-operator-6cf89ff566-kfnjr
(ncn-mw#) Find the cray-console-node pod that is connecting with the console.
NODE_POD=$(kubectl -n services exec $OP_POD -c cray-console-operator -c cray-console-operator -- sh -c \
"/app/get-node $XNAME" | jq .podname | sed 's/"//g')
echo $NODE_POD
Example output:
cray-console-node-1
(ncn-mw#) Check which NCN worker node the cray-console-node pod is running on.
kubectl -n services get pods -o wide | grep $NODE_POD
Example output:
cray-console-node-1 3/3 Running 0 3h55m 10.42.0.12 ncn-w010 <none> <none>
If the pod is running on the node that is going to be rebooted, then the interactive session
and logging will be interrupted while the NCN worker is drained and the pods are all
migrated to different NCN workers. To maintain an interactive console session, the
cray-console-node pod must be moved:
Cordon the NCN worker node to suspend scheduling, then delete the pod.
WNODE=ncn-wxxx
kubectl cordon $WNODE
kubectl -n services delete pod $NODE_POD
Wait for the pod to restart on another NCN worker.
Repeat the previous step to find if this node is now being monitored by a different cray-console-node pod.
NOTE: If desiring to minimize the disruption to console logging and interaction, then follow the Evacuation procedure to remove all console logging services prior to draining this node.
(ncn-mw#) Establish a serial console session with the desired NCN.
kubectl -n services exec -it $NODE_POD -c cray-console-node -- conman -j $XNAME
The console session log files for each NCN are located in a shared volume in the cray-console-node pods.
In those pods, the log files are in the /var/log/conman/ directory and are named console.<xname>.
Exit the connection to the console by entering &..
In order to avoid losing data while monitoring a reboot or power down of a worker node, first follow this procedure to evacuate the target worker node of its pods.
(ncn-mw#) Set the WNODE variable to the name of the worker node being evacuated.
Modify the following example to reflect the actual worker node number.
WNODE=ncn-wxxx
(ncn-mw#) Cordon the node so that rescheduled pods do not end up back on the same node.
kubectl cordon $WNODE
(ncn-mw#) Find all cray-console pods that need to be migrated.
This includes cray-console-node, cray-console-data (but not its Postgres pods), and cray-console-operator.
kubectl get pods -n services -l 'app.kubernetes.io/name in (cray-console-node, cray-console-data, cray-console-operator)' \
--field-selector spec.nodeName=$WNODE | awk '{print $1}'
Example output:
cray-console-operator-6cf89ff566-kfnjr
(ncn-mw#) Delete the cray-console-operator and cray-console-data pods listed in the previous step.
If none were listed, then skip this step.
Delete the pods.
for POD in $(kubectl get pods -n services -l 'app.kubernetes.io/name in (cray-console-data, cray-console-operator)' \
--field-selector spec.nodeName=$WNODE | awk '{print $1}'); do
kubectl -n services delete pod $POD
done
(ncn-mw#) Wait for the console-operator and console-data pods to be re-scheduled on other nodes.
Run the following command until both deployments show 2/2 pods are ready.
kubectl -n services get deployment | grep cray-console
Example output:
cray-console-data 2/2 1 1 1m
cray-console-operator 2/2 1 1 1m
(ncn-mw#) Delete any cray-console-node pods listed in the earlier step.
If none were listed, then skip this step.
Delete the pods.
for POD in $(kubectl get pods -n services -l 'app.kubernetes.io/name=cray-console-node' --field-selector spec.nodeName=$WNODE | awk '{print $1}'); do
kubectl -n services delete pod $POD
done
Wait for the console-node pods to be re-scheduled on other nodes.
Run the following command until all pods show ready.
kubectl -n services get statefulset cray-console-node
Example output:
NAME READY AGE
cray-console-node 3/3 1m
(ncn-mw#) After the node has been rebooted and can accept cray-console pods again, remove the node cordon.
kubectl uncordon $WNODE