Stage 4 - Ceph Upgrade

Important: Ceph does not need to be upgraded if this is an upgrade is staying within a CSM release, e.g. CSM-1.3.0-rc1 to CSM-1.3.0-rc2. If this is an upgrade staying within a CSM release, then Ceph is already running v16.2.9. See instructions in stage completed for next steps.

Reminder: If any problems are encountered and the procedure or command output does not provide relevant guidance, then see Relevant troubleshooting links for upgrade-related issues.

Ceph upgrade contents

The upgrade includes all fixes from v15.2.15 through v16.2.9. See the Ceph version index for details.

Start typescript

  1. (ncn-m002#) If a typescript session is already running in the shell, then first stop it with the exit command.

  2. (ncn-m002#) Start a typescript.

    script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_4.txt
    export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
    

If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.

Procedure

  • This upgrade is performed using the cubs_tool.
  • The cubs_tool.py can be found on ncn-s00[1-3] in /srv/cray/script/common/
  • Unless otherwise noted, all ceph commands that may need to be used in this stage may be run on any master node or any of the first three storage nodes (ncn-s001, ncn-s002, or ncn-s003).

Initiate upgrade

  1. (ncn-s001#) Check to ensure that the upgrade is possible.

    On :

    /srv/cray/scripts/common/cubs_tool.py --version 16.2.9 --registry localhost
    

    Example output:

    Upgrade Available!!  The specified version v16.2.9 has been found in the registry
    

    Note: If the output does not match what is expected, then this can indicate that a previous step has failed. Review the output from Stage 1 for errors or contact support.

  2. (ncn-s001#) Start the upgrade.

    /srv/cray/scripts/common/cubs_tool.py --version v16.2.9 --registry localhost --upgrade
    

    Example output:

    Upgrade Available!!  The specified version v16.2.9 has been found in the registry
    Initiating Ceph upgrade from v16.2.7 to v16.2.9
    

    The source version in the output may vary, but the target version should match what is shown above.

    If this is an in family upgrade and the Ceph upgrade was completed during Stage 1, then the upgrade will not run again. The expected output is stated below.

    Your current version is the same as the proposed version 16.2.9
    
  3. Monitor the upgrade.

    The cubs_tool will automatically watch the upgrade. As services are upgraded, they will move from the Total Current column to the Total Upgraded column.

    +---------+---------------+----------------+
    | Service | Total Current | Total Upgraded |
    +---------+---------------+----------------+
    |   MGR   |       0       |       2        |
    |   MON   |       3       |       0        |
    |  Crash  |       3       |       0        |
    |   OSD   |       9       |       0        |
    |   MDS   |       3       |       0        |
    |   RGW   |       3       |       0        |
    +---------+---------------+----------------+
    

    The final result should have 0 for every service in the Total Current column.

    +---------+---------------+----------------+
    | Service | Total Current | Total Upgraded |
    +---------+---------------+----------------+
    |   MGR   |       0       |       3        |
    |   MON   |       0       |       3        |
    |  Crash  |       0       |       3        |
    |   OSD   |       0       |       9        |
    |   MDS   |       0       |       3        |
    |   RGW   |       0       |       3        |
    +---------+---------------+----------------+
    
  4. (ncn-s001#) Verify that the upgrade completed successfully.

    /srv/cray/scripts/common/cubs_tool.py --report
    

    Expected output:

    +----------+-------------+-----------------+---------+---------+
    |   Host   | Daemon Type |        ID       | Version |  Status |
    +----------+-------------+-----------------+---------+---------+
    | ncn-s001 |     mgr     | ncn-s001.antgnu |  16.2.9 | running |
    | ncn-s002 |     mgr     | ncn-s002.jhwgup |  16.2.9 | running |
    | ncn-s003 |     mgr     | ncn-s003.wzoivk |  16.2.9 | running |
    +----------+-------------+-----------------+---------+---------+
    +----------+-------------+----------+---------+---------+
    |   Host   | Daemon Type |    ID    | Version |  Status |
    +----------+-------------+----------+---------+---------+
    | ncn-s001 |     mon     | ncn-s001 |  16.2.9 | running |
    | ncn-s002 |     mon     | ncn-s002 |  16.2.9 | running |
    | ncn-s003 |     mon     | ncn-s003 |  16.2.9 | running |
    +----------+-------------+----------+---------+---------+
    +----------+-------------+----------+---------+---------+
    |   Host   | Daemon Type |    ID    | Version |  Status |
    +----------+-------------+----------+---------+---------+
    | ncn-s001 |    crash    | ncn-s001 |  16.2.9 | running |
    | ncn-s002 |    crash    | ncn-s002 |  16.2.9 | running |
    | ncn-s003 |    crash    | ncn-s003 |  16.2.9 | running |
    +----------+-------------+----------+---------+---------+
    +----------+-------------+----+---------+---------+
    |   Host   | Daemon Type | ID | Version |  Status |
    +----------+-------------+----+---------+---------+
    | ncn-s001 |     osd     | 0  |  16.2.9 | running |
    | ncn-s001 |     osd     | 3  |  16.2.9 | running |
    | ncn-s001 |     osd     | 7  |  16.2.9 | running |
    | ncn-s002 |     osd     | 1  |  16.2.9 | running |
    | ncn-s002 |     osd     | 5  |  16.2.9 | running |
    | ncn-s002 |     osd     | 8  |  16.2.9 | running |
    | ncn-s003 |     osd     | 2  |  16.2.9 | running |
    | ncn-s003 |     osd     | 4  |  16.2.9 | running |
    | ncn-s003 |     osd     | 6  |  16.2.9 | running |
    +----------+-------------+----+---------+---------+
    +----------+-------------+------------------------+---------+---------+
    |   Host   | Daemon Type |           ID           | Version |  Status |
    +----------+-------------+------------------------+---------+---------+
    | ncn-s001 |     mds     | cephfs.ncn-s001.sbtjip |  16.2.9 | running |
    | ncn-s002 |     mds     | cephfs.ncn-s002.gywfal |  16.2.9 | running |
    | ncn-s003 |     mds     | cephfs.ncn-s003.emebxe |  16.2.9 | running |
    +----------+-------------+------------------------+---------+---------+
    +----------+-------------+-----------------------+---------+---------+
    |   Host   | Daemon Type |           ID          | Version |  Status |
    +----------+-------------+-----------------------+---------+---------+
    | ncn-s001 |     rgw     | site1.ncn-s001.rrfbvo |  16.2.9 | running |
    | ncn-s002 |     rgw     | site1.ncn-s002.axqnca |  16.2.9 | running |
    | ncn-s003 |     rgw     | site1.ncn-s003.pxhahp |  16.2.9 | running |
    +----------+-------------+-----------------------+---------+---------+
    

    NOTE: This is an example only and is showing only the core Ceph components.

Diagnose a stalled upgrade

The processes running the Ceph container image will go through the upgrade process. This involves stopping the old process and restarting the process with the new version 16.2.9 container image.

IMPORTANT: Only processes running the 15.2.15 image will be upgraded. This includes crash, mds, mgr, mon, osd, and rgw processes only.

UPGRADE_FAILED_PULL: Upgrade: failed to pull target image

If ceph -s shows a warning with UPGRADE_FAILED_PULL: Upgrade: failed to pull target image as the description, then perform the following procedure on any of the first three storage nodes (ncn-s001, ncn-s002, or ncn-s003).

  1. (ncn-s#) Check the upgrade status.

    ceph orch upgrade status
    

    Example output:

    {
       "target_image": "registry.local/artifactory.algol60.net/csm-docker/stable/quay.io/ceph/ceph:v15.2.15",
       "in_progress": true,
       "services_complete": [],
       "message": "Error: UPGRADE_FAILED_PULL: Upgrade: failed to pull target image"
     }
    
  2. (ncn-s#) Pause and resume the upgrade.

    ceph orch upgrade pause
    ceph orch upgrade resume
    
  3. (ncn-s#) Watch cephadm.

    This command watches the cephadm logs. If the issue occurs again, then it will give more details about which node may be having an issue.

    ceph -W cephadm
    
  4. (ncn-s#) If the issue occurs again, then log into each of the storage nodes and perform a podman pull of the image.

    podman pull localhost/quay.io/ceph/ceph:v16.2.9
    

If these steps do not resolve the issue, then contact support for further assistance.

Troubleshoot a failed upgrade

See Ceph Orchestrator Usage for additional usage and troubleshooting.

Stop typescript

For any typescripts that were started during this stage, stop them with the exit command.

Stage completed

DO NOT proceed past this point if the upgrade has not completed and been verified. Contact support for in-depth troubleshooting.

This stage is completed. Proceed to Validate CSM health on the main upgrade page.