Migrating Kubernetes CNI from Weave to Cilium

This describes how to migrate Kubernetes CNI from Weave to Cilium during a CSM upgrade.

Steps

  1. (ncn-m#) Run the migration script:

    /usr/share/doc/csm/scripts/cilium_migration.sh
    

    This script will:

    • Create and execute the migration workflow in the argo namespace.
    • Migrate the CNI from Weave to Cilium.
    • Continuously monitor the workflow status using kubectl.
  2. (ncn-mw#) Monitor the migration workflow:

    The workflow status can also be tracked using the Argo CLI:

    The Argo CLI watch function can be used to view the overall progress of the workflow.

    argo watch <workflow-name> -n argo
    

    The Argo CLI logs function can be used to monitor the workflow in more detail.

    argo logs <workflow-name> -n argo -f
    

    Replace <workflow-name> with the actual name of the workflow created by the cilium_migration.sh script.

Known issues

Node drain blocked by Kafka

When restarting the pods on the NCN worker nodes, it is possible for the workflow to get stuck trying to evict cray-shared-kafka-kafka or SMA cluster-kafka pods.

Example output:

evicting pod services/cray-shared-kafka-kafka-1
error when evicting pods/"cray-shared-kafka-kafka-1" -n "services" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. 

The issue is that one of the restarted Kafka pods cannot communicate with Zookeeper. This is the problem described in cfs-api pods in CLBO state during CSM install, and it has the same workaround.

  • (ncn-mw#) If the stuck pod is part of cray-shared-kafka, then restart that Zookeeper instance.

    kubectl delete pods -n services -l strimzi.io/controller-name=cray-shared-kafka-zookeeper
    
  • (ncn-mw#) If the stuck pod is a member of SMA cluster-kafka, then restart the SMA Zookeeper instance.

    kubectl delete pod -n sma -l strimzi.io/controller-name=cluster-zookeeper
    

Node drain blocked by an etcd cluster

When restarting the pods on the NCN worker nodes, it is possible for the workflow to get stuck trying to evict pods from an etcd cluster.

Example output:

evicting pod services/cray-hbtd-bitnami-etcd-2
error when evicting pods/"cray-hbtd-bitnami-etcd-2" -n "services" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. 

See etcd Pods in CLBO State for more information and a workaround.