During the CSM upgrade, IUF warns that multiple sessions are in progress for an activity.
The next stage for the activity does not run because of this warning.
This issue is seen after pre-install-check stage or management-nodes-rollout stage of the IUF run.
This issue causes the session associated with the activity to continue to be in “in progress” even after the workflow associated with the stage has successfully completed.
When the issue occurs the following messages are emitted by iuf-cli:
INFO All logs will be stored in /etc/cray/upgrade/csm/iuf/update-csm-1.6.0/log/20241021025621
INFO [ACTIVITY: update-csm-1.6.0 ] BEG Install started at 2024-10-21 02:56:21.778284
INFO Neither --recipe-vars nor --bootprep-config-dir were specified, so
product_vars.yaml will be pulled from the branch
cray/hpc-csm-software-recipe/25.1.0-alpha-20241019174014-8f492eb of the
hpc-csm-software-recipe git repo.
INFO [IUF SESSION: ] BEG Started at 2024-10-21 02:56:30.375445
WARN multiple sessions found. Taking the first one...
INFO [IUF SESSION: ] END Completed at 2024-10-21 02:56:30.568155
INFO [ACTIVITY: update-csm-1.6.0 ] END Completed in 0:00:08
There is a race condition in cray-nls that is hit when multiple cray-nls pods are starting at the same time.
This happens during a cray-nls chart upgrade and sometimes when a node with multiple cray-nls pods is drained, which causes these pods to start simultaneously on another node.
Identify the session for the previous stage which ran successfully for the activity being run.
2024-10-21T01:40:09.731277Z INFO [IUF SESSION: update-csm-1-6-0-h0y63 ] BEG Started at 2024-10-21 01:40:09.731167
2024-10-21T01:40:13.585718Z DBG Next workflow update-csm-1-6-0-h0y63-management-nodes-rollout-wnbpb
(ncn-mw#) Find the ConfigMap associated with the session from previous step in the argo namespace.
kubectl get cm -n argo --selector type=iuf_session |grep <session_name>
(ncn-mw#) Make a backup of the ConfigMap because it will be edited in the next step.
kubectl get cm -n argo <session_name> -o yaml > <session_name>_cm_backup.yaml
(ncn-mw#) Edit the ConfigMap to modify "current_state" to "completed" if "current_state" is "in_progress".
kubectl edit configmap -n argo <session_name> -o json
Re-run workflow using the same IUF command.
With the previous session set to "completed", the multiple sessions warning should not be seen and the workflow should run as expected.