Cray System Management Documentation > Upgrade CSM > Stage 2 - Ceph image upgrade

Stage 2 - Ceph image upgrade

Reminder: If any problems are encountered and the procedure or command output does not provide relevant guidance, see Relevant troubleshooting links for upgrade-related issues.

Start typescript
Argo workflows
Storage node image upgrade and Ceph upgrade
Stop typescript
Stage completed

Start typescript

(ncn-m001#) If a typescript session is already running in the shell, then first stop it with the exit command.

(ncn-m001#) Start a typescript.

script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_2.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '

If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.

Argo workflows

Before starting the Storage node image upgrade and Ceph upgrade, access the Argo UI to view the progress of this stage. Note that the progress for the current stage will not show up in Argo before the storage node image upgrade script has been started.

For more information, see Using the Argo UI and Using Argo Workflows.

Storage node image upgrade and Ceph upgrade

(ncn-m001#) Run ncn-upgrade-worker-storage-nodes.sh with the --upgrade flag for all storage nodes to be upgraded. Provide the storage nodes in a comma-separated list, such as ncn-s001,ncn-s002,ncn-s003. This upgrades the storage nodes sequentially. Once all storage nodes have been upgraded, this workflow will upgrade Ceph to v17.2.6.

/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-s001,ncn-s002,ncn-s003 --upgrade

NOTE It is possible to upgrade a single storage node at a time using the following command.

/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-s001 --upgrade

Storage node image upgrade troubleshooting

If the storage node upgrade is looping on the wait-for-ncn-s00X-health stage and Ceph is in a HEALTH_WARN state, this is likely not a problem. Ceph needs time to recover after a node upgrade. Run ceph -s and observe that the percentage by Degraded data redundancy is decreasing. If the percentage is not decreasing, then continue to the following troubleshooting statements.

The best troubleshooting tool for this stage is the Argo UI. Information about accessing this UI and about using Argo Workflows is above.

If the upgrade is ‘waiting for Ceph HEALTH_OK’, the output from commands ceph -s and ceph health detail should provide information.

If a crash has occurred, dumping the Ceph crash data will return Ceph to healthy state and allow the upgrade to continue. The crash should be evaluated to determine if there is an issue that should be addressed.

Refer to storage troubleshooting documentation for Ceph related issues.

Refer to troubleshoot Ceph image with tag:’<none>’ if running podman images on a storage node shows an image with tag:<none>.

Refer to Storage node cloud-init fails with ‘Timed out waiting for device’ error if cloud-init is failing on the storage node.

Update ceph node-exporter config for SNMP counters

OPTIONAL: This is an optional step.

This uses netstat collector form node-exporter and enables all the SNMP counters monitoring in /proc/net/snmp on ncn nodes.

See Update ceph node-exporter configuration to update the ceph node-exporter configuration to monitor SNMP counters.

Stop typescript

For any typescripts that were started during this stage, stop them with the exit command.

Stage completed

All the Ceph nodes have been rebooted into the new image.

This stage is completed. Continue to Stage 3.