During the CSM upgrade, IUF reports that multiple sessions are in progress for an activity. The next stage for the activity does not run due to above error. This issue is seen after pre-install-check stage or management-nodes-rollout stage of iuf run.
This issue causes the session associated with the activity to continue to be in “in progress” even after workflow associated with the stage has successfully completed.
When the issue occurs the following errors are emitted by iuf cli:
iuf -a "${ACTIVITY_NAME}" run -r management-nodes-rollout --limit-management-rollout ${WORKER_CANARY}
INFO All logs will be stored in /etc/cray/upgrade/csm/iuf/update-csm-1.6.0/log/20241021025621
INFO [ACTIVITY: update-csm-1.6.0 ] BEG Install started at 2024-10-21 02:56:21.778284
INFO Neither --recipe-vars nor --bootprep-config-dir were specified, so
product_vars.yaml will be pulled from the branch
cray/hpc-csm-software-recipe/25.1.0-alpha-20241019174014-8f492eb of the
hpc-csm-software-recipe git repo.
INFO [IUF SESSION: ] BEG Started at 2024-10-21 02:56:30.375445
WARN multiple sessions found. Taking the first one...
INFO [IUF SESSION: ] END Completed at 2024-10-21 02:56:30.568155
INFO [ACTIVITY: update-csm-1.6.0 ] END Completed in 0:00:08
There is a race condition in cray-nls
that is hit when multiple cray-nls
pods are starting at the same time.
This happens during a cray-nls
chart upgrade and sometimes when a node with multiple cray-nls
pods is drained, which causes these pods to start simultaneously on another node.
Step 1: Identify the session for the previous stage which ran successfully for the activity being run.
2024-10-21T01:40:09.731277Z INFO [IUF SESSION: update-csm-1-6-0-h0y63 ] BEG Started at 2024-10-21 01:40:09.731167
2024-10-21T01:40:13.585718Z DBG Next workflow update-csm-1-6-0-h0y63-management-nodes-rollout-wnbpb
Step 2: Find the configmap associated with the session from previous step in argo namespace.
kubectl get cm -n argo --selector type=iuf_session |grep <session_name>
Step 3: Make a backup of the configmap since it will be edited in the next step.
kubectl get cm -n argo <session_name> -o yaml > <session_name>_cm_backup.yaml
Step 4: Edit the configmap to modify “current_state” to “completed” if “current_state” is “in_progress”.
kubectl edit configmap -n argo <session_name> -o json