management-nodes-rollout
failedWhen upgrading master nodes with IUF during the CSM 1.6.0 upgrade, the IUF CLI may say that the master node upgrade failed.
The specific error is: The management-nodes-rollout stage failed, but argo must run to the completion of the stage
.
This is a bug in the IUF CLI and it happens because while master nodes are upgrading, the IUF CLI is momentarily unable to connect to the cray-nls
service.
The IUF CLI erroneously reports that the management-nodes-rollout stage failed
even though management-nodes-rollout
is likely continuing successfully in the background.
This problem has been resolved in CSM 1.6.1.
The following error can be seen when upgrading master nodes with IUF.
INFO [backup-m ] BEG backup-m001
INFO [backup-m ] BEG backup-m001(0)
INFO [backup-m ] END backup-m001 [Succeeded]
INFO [backup-m ] END backup-m001(0) [Succeeded]
INFO [upgrade-m ] BEG upgrade-m001
INFO [upgrade-m ] BEG upgrade-m001(0)
ERR Workflow activity1-wjtdt-management-nodes-rollout-bf9nx not found.
INFO [STAGE: management-nodes-rollout ] END Unknown in 0:03:18
WARN The management-nodes-rollout stage failed, but argo must run to the completion of the stage.
WARN Still waiting for workflow startup after 15 seconds.
WARN Still waiting for workflow startup after 30 seconds.
[...]
ERR Giving up after 10 minutes. Check to ensure the ARGO backend is functional then try again.
Error Summary:
Workflow activity1-wjtdt-management-nodes-rollout-bf9nx not found.
Giving up after 10 minutes. Check to ensure the ARGO backend is functional then try again.
Error Summary:
Giving up after 10 minutes. Check to ensure the ARGO backend is functional then try again.
The Observed error is a false negative. No action is needed other than manually monitoring the management-nodes-rollout
Argo workflow continuing successfully in the background.
The IUF workflow can be monitored in either of the following two ways.
Use the Argo UI to monitor the workflow. See using the Argo UI for details on accessing the UI.
(ncn-m001#
) Monitor the logs of the pod running the management-nodes-rollout
Argo workflow.
Get pods in the argo
namespace, grep
for ‘management’.
kubectl get pods -n argo | grep management
Follow the logs for the pod running the management-nodes-rollout
stage.
kubectl logs -n argo <pod_name> -f
Continue monitoring the management-nodes-rollout
workflow until it completes successfully.
If it fails, investigate the logs from the sources above to find the error.
Additionally, a master node upgrade will print its upgrade output to /root/output.log
.
This file will be on ncn-m001
when ncn-m002
or ncn-m003
are being upgraded and the file will be on ncn-m002
when ncn-m001
is being upgraded.
This file can be used to debug a failed master node upgrade.