Troubleshoot Pods Failing to Restart on Other Worker Nodes

Troubleshoot an issue where pods cannot restart on another worker node because of the “Volume is already exclusively attached to one node and can’t be attached to another” error. Kubernetes does not currently support readwritemany access mode for Rados Block Device (RBD) devices, which causes an issue where devices fail to unmap correctly.

The issue occurs when unmounting the mounts tied to the RBD devices, which causes therbd-task (watcher) to not stop for the RBD device.

WARNING: If this process is followed and there are mount points that cannot be unmounted without using the force option, then a process may still be writing to them. If mount points are forcefully unmounted, there is a high probability of data loss or corruption.

Prerequisites

This procedure requires administrative privileges.

Procedure

  1. Force delete the pod.

    This may not be successful, but it is important to try before proceeding.

    NAMESPACE=vault
    POD_NAME=cray-vault-0
    kubectl delete pod -n $NAMESPACE $POD_NAME --force --grace-period=0
    
  2. Log in to a manager node and proceed if the previous step did not fix the issue.

  3. Describe the pod experiencing issues.

    The returned Persistent Volume Claim (PVC) information will be needed in future steps.

    kubectl -n services describe pod $POD_NAME
    

    Example output:

    [...]
    
    Events:
      Type     Reason              Age   From                     Message
      ----     ------              ----  ----                     -------
      Normal   Scheduled           23s   default-scheduler        Successfully assigned services/cray-vault-0 to ncn-w003
      Warning  FailedAttachVolume  23s   attachdetach-controller  Multi-Attach error for volume "**pvc-186dc7a5-9c9a-450b-b856-4308c331b37**" Volume is already exclusively attached to one node and can't be attached to another
    

    In this example, pvc-186dc7a5-9c9a-450b-b856-4308c331b37 is the PVC information required for the next step.

  4. Retrieve the Ceph volume.

    PVC_NAME=pvc-186dc7a5-9c9a-450b-b856-4308c331b37
    kubectl describe -n $NAMESPACE pv $PVC_NAME
    

    Example output:

    Name:            pvc-186dc7a5-9c9a-450b-b856-4308c331b373
    Labels:          <none>
    Annotations:     pv.kubernetes.io/provisioned-by: rbd.csi.ceph.com
    Finalizers:      [kubernetes.io/pv-protection external-provisioner.volume.kubernetes.io/finalizer]
    StorageClass:    k8s-block-replicated
    Status:          Bound
    Claim:           vault/vault-raft-cray-vault-0
    Reclaim Policy:  Delete
    Access Modes:    RWO
    VolumeMode:      Filesystem
    Capacity:        2Gi
    Node Affinity:   <none>
    Message:
    Source:
        Type:              CSI (a Container Storage Interface (CSI) volume source)
        Driver:            rbd.csi.ceph.com
        FSType:            ext4
        VolumeHandle:      0001-0024-f17091d2-31a2-11f0-b30a-42010afc0104-000000000000000b-10f31479-31a7-11f0-9092-ea5ff752cdc5
        ReadOnly:          false
        VolumeAttributes:      clusterID=f17091d2-31a2-11f0-b30a-42010afc0104
                               imageFeatures=layering
                               imageName=csi-vol-10f31479-31a7-11f0-9092-ea5ff752cdc5
                               journalPool=kube
                               pool=kube
                               storage.kubernetes.io/csiProvisionerIdentity=1747325223051-8081-rbd.csi.ceph.com
    Events:                <none>
    
  5. Save the RBD name.

    CEPH_IMAGE_NAME=$(kubectl get pv -n vault $PVC_NAME -o json | \
    jq -r '.spec.csi.volumeAttributes.imageName')
    RBD_NAME=$(kubectl get pv -n vault $PVC_NAME -o json | \
    jq -r '"\(.spec.csi.volumeAttributes.pool)/\(.spec.csi.volumeAttributes.imageName)"')
    
  6. Find the worker node that has the RBD locked.

    1. Find the RBD status.

      Take a note of the returned IP address.

      rbd status $RBD_NAME
      

      For example:

      rbd status $RBD_NAME
      

      Output:

      Watchers:
              watcher=10.252.1.11:0/3628969487 client.74826 cookie=18446462598732840963
      
    2. Use the returned IP address to get the host name attached to it.

      Take note of the returned host name.

      IP_ADDRESS=10.252.1.11
      grep $IP_ADDRESS /etc/hosts
      

      Example output:

      10.252.1.11     ncn-w005.nmn ncn-w005
      
  7. Save the host name returned in the previous step.

    HOST_NAME=ncn-w005
    
  8. Unmap the device.

    1. Find the RBD number.

      Use the CEPH_IMAGE_NAME value returned.

      ssh $HOST_NAME rbd showmapped|grep $CEPH_IMAGE_NAME
      

      Example output:

      2   kube             csi-vol-5a91ee3d-4539-11ef-a44c-2629c446168b  -     /dev/rbd2
      

      Take note of the returned RBD number, which will be used in the next step.

    2. SSH to the host and set varibles needed

      ssh $HOST_NAME
      RBD_NUMBER=rbd2
      
    3. Verify it is not in use by an unstopped container.

      mount|grep $RBD_NUMBER
      

      If no mount points are returned, proceed to the next step. If mount points are returned, run the following command for each mount point:

      unmount MOUNT_POINT
      

      Troubleshooting: If that still does not succeed, use the unmount -f option.

      WARNING: If mount points are forcefully unmounted, there is a chance for data loss or corruption.

    4. Unmap the device.

      rbd unmap -o force /dev/$RBD_NUMBER
      
    5. Disconnect from the host

      exit
      
  9. Check the status of the pod.

    kubectl get pod -n $NAMESPACE $POD_NAME
    

    Troubleshooting: If the pod status has not changes, try deleting the pod to restart it.

    kubectl delete pod -n $NAMESPACE $POD_NAME