During upgrades from CSM 1.6.x to 1.7.0, the IUF management-nodes-rollout that targets worker nodes can hang on the final worker node due to DVS related errors.
The rollout may appear stalled even though most worker nodes have successfully received the new image. IUF logs may repeatedly show errors with DVS modules, as below:
INFO Running before each hook: cos-prechecks-for-worker-reboots
dvs module is not loaded on x3000c0s8b0n0
dvs module is not loaded on x3000c0s9b0n0
dvs module is not loaded on x3000c0s10b0n0
ERROR: HA requires at least 1 running DVS server, but there are none
Pods not running.
Pods related to DVS may be in an Error or NotReady state, and Argo Workflows such as ncn-lifecycle-rebuild may be in a loop or failing.
cos-prechecks-for-worker-reboots is mentioned in logs.NotReady or Error status, e.g. with kubectl get pods -A | grep dvs.ncn-lifecycle-rebuild) can become stuck and require manual intervention.An IUF hook (cos-prechecks-for-worker-reboots) was removed from the docs-csm CSM 1.7.0 rpm. However, the corresponding Kubernetes IUF hook object created by the 1.6.x rpm can remain in the cluster.
The upgrade workflow template before-each-hooks lists and executes hook objects in the argo namespace:
kubectl get hooks -n argo -l before-each=true
When the obsolete IUF hook object is executed, the DVS NCN health check can fail and block the worker rebuild, causing IUF to hang on the last worker node.
If you encounter the issue during the upgrade, perform the following steps to recover and proceed:
Find and delete the cos-prechecks-for-worker-reboots IUF hook:
Refer to the note under Management Rollout for NCN worker nodes
for details on how to manually check and remove the cos-prechecks-for-worker-reboots IUF hook from the cluster.
If Argo shows a stuck workflow (e.g., upgrade-recipe-25-9-0-management-nodes-rollout), remove it:
kubectl -n argo get wf
kubectl -n argo delete wf upgrade-recipe-25-9-0-management-nodes-rollout
Or delete via the Argo UI.
Note: After deleting the hook, the workflow may attempt to run the hook and report NotFound. If the IUF process is unresponsive, interrupt with Ctrl-C and force exit.
Label worker nodes that have already received the image so subsequent IUF runs skip them:
kubectl label node <node-name> iuf-prevent-rollout=true --overwrite
Example:
kubectl label nodes ncn-w002 ncn-w003 --overwrite iuf-prevent-rollout=true
kubectl get nodes --show-labels | grep iuf-prevent-rollout
Restart the IUF operation for the remaining worker nodes:
iuf -a "${ACTIVITY_NAME}" -m "${MEDIA_DIR}" run -r management-nodes-rollout --limit-management-rollout <worker>