Establish a Serial Connection to NCNs

The ConMan pod can be used to establish a serial console connection with each non-compute node (NCN) in the system.

In the scenario of a power down or reboot of an NCN worker, one must first determine if any cray-console pods are running on that NCN. It is important to move cray-console pods to other worker nodes before rebooting or powering off a worker node to minimize disruption in console logging. If a brief interruption in console logging and interactive access is acceptable while the NCN worker is being drained, then the evacuation may be skipped.

If a cray-console-node pod is running on a worker node when it is powered off or rebooted, then access to its associated consoles will be unavailable until one of the following things happens:

  • the worker node comes back up and the cray-console-node pod begins running on it.
  • the cray-console-node pod is terminated and comes up on another worker node.
  • the cray-console-operator pod assigns the associated consoles to a different cray-console-node pod.

Prerequisites

The user performing these procedures needs to have access permission to the cray-console-operator and cray-console-node pods.

Connection procedure

  1. (ncn-mw#) Find the cray-console-operator pod.

    OP_POD=$(kubectl get pods -n services \
            -o wide|grep cray-console-operator|awk '{print $1}')
    echo $OP_POD
    

    Example output:

    cray-console-operator-6cf89ff566-kfnjr
    
  2. (ncn-mw#) Find the cray-console-node pod that is connecting with the console.

    NODE_POD=$(kubectl -n services exec $OP_POD -c cray-console-operator -- sh -c \
        "/app/get-node $XNAME" | jq .podname | sed 's/"//g')
    echo $NODE_POD
    

    Example output:

    cray-console-node-1
    
  3. (ncn-mw#) Check which NCN worker node the cray-console-node pod is running on.

    kubectl -n services get pods -o wide | grep $NODE_POD
    

    Example output:

    cray-console-node-1   3/3  Running  0  3h55m   10.42.0.12  ncn-w010   <none>   <none>
    

    If the pod is running on the node that is going to be rebooted, then the interactive session and logging will be interrupted while the NCN worker is drained and the pods are all migrated to different NCN workers. To maintain an interactive console session, the cray-console-node pod must be moved:

    1. Cordon the NCN worker node to suspend scheduling, then delete the pod.

      WNODE=ncn-wxxx
      kubectl cordon $WNODE
      kubectl -n services delete pod $NODE_POD
      
    2. Wait for the pod to restart on another NCN worker.

    3. Repeat the previous step to find if this node is now being monitored by a different cray-console-node pod.

    NOTE: If desiring to minimize the disruption to console logging and interaction, then follow the Evacuation procedure to remove all console logging services prior to draining this node.

  4. (ncn-mw#) Establish a serial console session with the desired NCN.

    kubectl -n services exec -it $NODE_POD -c cray-console-node -- conman -j $XNAME
    

    The console session log files for each NCN are located in a shared volume in the cray-console-node pods. In those pods, the log files are in the /var/log/conman/ directory and are named console.<xname>.

  5. Exit the connection to the console by entering &..

Evacuation procedure

In order to avoid losing data while monitoring a reboot or power down of a worker node, first follow this procedure to evacuate the target worker node of its pods.

  1. (ncn-mw#) Set the WNODE variable to the name of the worker node being evacuated.

    Modify the following example to reflect the actual worker node number.

    WNODE=ncn-wxxx
    
  2. (ncn-mw#) Cordon the node so that rescheduled pods do not end up back on the same node.

    kubectl cordon $WNODE
    
  3. (ncn-mw#) Find all cray-console pods that need to be migrated.

    This includes cray-console-node, cray-console-data (but not its Postgres pods), and cray-console-operator.

    kubectl get pods -n services -l 'app.kubernetes.io/name in (cray-console-node, cray-console-data, cray-console-operator)' \
      --field-selector spec.nodeName=$WNODE | awk '{print $1}'
    

    Example output:

    cray-console-operator-6cf89ff566-kfnjr
    
  4. (ncn-mw#) Delete the cray-console-operator and cray-console-data pods listed in the previous step.

    If none were listed, then skip this step.

    1. Delete the pods.

      for POD in $(kubectl get pods -n services -l 'app.kubernetes.io/name in (cray-console-data, cray-console-operator)' \
        --field-selector spec.nodeName=$WNODE | awk '{print $1}'); do
              kubectl -n services delete pod $POD
      done
      
    2. (ncn-mw#) Wait for the console-operator and console-data pods to be re-scheduled on other nodes.

      Run the following command until both deployments show 2/2 pods are ready.

      kubectl -n services get deployment | grep cray-console
      

      Example output:

      cray-console-data           2/2     1          1     1m
      cray-console-operator       2/2     1          1     1m
      
  5. (ncn-mw#) Delete any cray-console-node pods listed in the earlier step.

    If none were listed, then skip this step.

    1. Delete the pods.

      for POD in $(kubectl get pods -n services -l 'app.kubernetes.io/name=cray-console-node' --field-selector spec.nodeName=$WNODE | awk '{print $1}'); do
          kubectl -n services delete pod $POD
      done
      
    2. Wait for the console-node pods to be re-scheduled on other nodes.

      Run the following command until all pods show ready.

      kubectl -n services get statefulset cray-console-node
      

      Example output:

      NAME                READY   AGE
      cray-console-node   3/3     1m
      
  6. (ncn-mw#) After the node has been rebooted and can accept cray-console pods again, remove the node cordon.

    kubectl uncordon $WNODE