Stage 0 - Prerequisites and Preflight Checks

Reminders:

Stage 0 has several critical procedures which prepare the environment and verify if the environment is ready for the upgrade.

Start typescript

  1. (ncn-m001#) If a typescript session is already running in the shell, then first stop it with the exit command.

  2. (ncn-m001#) Start a typescript.

    script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_0.txt
    export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
    

If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.

Stage 0.1 - Prepare assets

  1. (ncn-m001#) Set the CSM_RELEASE variable to the target CSM version of this upgrade.

    If upgrading to a patch version of CSM, be sure to specify the correct patch version number when setting this variable.

    export CSM_RELEASE=1.4.0
    
  2. (ncn-m001#) Install the latest docs-csm and libcsm RPMs. These should be for the target CSM version of the upgrade, not the currently installed CSM version. See the short procedure in Check for latest documentation.

  3. (ncn-m001#) Run the script to create a cephfs file share at /etc/cray/upgrade/csm.

    • This script creates a new cephfs file share, and will unmount the rbd device that may have been used in a previous version of CSM (if detected). Running this script is a one time step needed only on the master node the upgrade is being initiated on (ncn-m001). If a previous rbd mount is detected at /etc/cray/upgrade/csm, that content will be remounted and available at /mnt/csm-1.3-rbd.

      /usr/share/doc/csm/scripts/mount-cephfs-share.sh
      

      Expected output looks similar to the following:

      Found previous CSM release rbd mount, moving to /mnt/csm-1.3-rbd...
      Unmounting /etc/cray/upgrade/csm...
      Replacing /etc/cray/upgrade/csm with /mnt/csm-1.3-rbd in /etc/fstab...
      Mounting /mnt/csm-1.3-rbd to preserve previous upgrade content...
      Found s3fs mount at /var/lib/admin-tools, removing...
      Unmounting /var/lib/admin-tools...
      Removing /var/lib/admin-tools from /etc/fstab...
      Creating admin-tools ceph fs share...
      Sleeping for five seconds waiting for 3 running mds.admin-tools daemons...
      Sleeping for five seconds waiting for 3 running mds.admin-tools daemons...
      Sleeping for five seconds waiting for 3 running mds.admin-tools daemons...
      Found 3 running mds.admin-tools daemons -- continuing...
      Creating admin-tools keyring...
      [client.admin-tools]
         key = <REDACTED>
      export auth(key=<REDACTED>
      Adding fstab entry for cephfs share...
      Done! /etc/cray/upgrade/csm is mounted as a cephfs share!
      

    NOTE: The following steps are not part of the upgrade procedure, but rather informative about how to access data from previous upgrades stored on an rbd device:

    • After completing the CSM upgrade, all master nodes will automatically mount the new cephfs file share at /etc/cray/upgrade/csm. The content from a previous rbd device is still available, and can be accessed by executing the following steps:

      mkdir -pv /mnt/csm-1.3-rbd
      rbd map csm_admin_pool/csm_scratch_img
      mount /dev/rbd0 /mnt/csm-1.3-rbd
      
    • If at some point the previous upgrade’s artifacts are no longer needed that were stored in an rbd mount, the following steps can be followed to remove the rbd:

      ceph config set mon mon_allow_pool_delete true
      ceph osd pool rm csm_admin_pool csm_admin_pool --yes-i-really-really-mean-it
      ceph config set mon mon_allow_pool_delete false
      
  4. Follow either the Direct download or Manual copy procedure.

    • If there is a URL for the CSM tar file that is accessible from ncn-m001, then the Direct download procedure may be used.
    • Alternatively, the Manual copy procedure may be used, which includes manually copying the CSM tar file to ncn-m001.

Direct download

  1. (ncn-m001#) Set the ENDPOINT variable to the URL of the directory containing the CSM release tar file.

    In other words, the full URL to the CSM release tar file must be ${ENDPOINT}/csm-${CSM_RELEASE}.tar.gz

    NOTE This step is optional for Cray/HPE internal installs, if ncn-m001 can reach the internet.

    ENDPOINT=https://put.the/url/here/
    
  2. This step should ONLY be performed if an http proxy is required to access a public endpoint on the internet for the purpose of downloading artifacts. CSM does NOT support the use of proxy servers for anything other than downloading artifacts from external endpoints. The http proxy variables must be unset after the desired artifacts are downloaded. Failure to unset the http proxy variables after downloading artifacts will cause many failures in subsequent steps.

    • Secured:

      export https_proxy=https://example.proxy.net:443
      
    • Unsecured:

      export http_proxy=http://example.proxy.net:80
      
  3. (ncn-m001#) Run the script. NOTE For Cray/HPE internal installs, if ncn-m001 can reach the internet, then the --endpoint argument may be omitted.

    The prepare-assets.sh script will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the --no-delete-tarball-file argument to the prepare-assets.sh command below.

    /usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --endpoint "${ENDPOINT}"
    
  4. This step must be performed if an http proxy was set previously.

    unset https_proxy
    
    unset http_proxy
    
  5. Skip the Manual copy subsection and proceed to Stage 0.2 - Prerequisites

Manual copy

  1. Copy the CSM release tar file to ncn-m001.

    See Update Product Stream.

  2. (ncn-m001#) Set the CSM_TAR_PATH variable to the full path to the CSM tar file on ncn-m001.

    CSM_TAR_PATH=/path/to/csm-${CSM_RELEASE}.tar.gz
    
  3. (ncn-m001#) Run the script.

    The prepare-assets.sh script will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the --no-delete-tarball-file argument to the prepare-assets.sh command below.

    /usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --tarball-file "${CSM_TAR_PATH}"
    

Stage 0.2 - Prerequisites

  1. (ncn-m001#) Set the SW_ADMIN_PASSWORD environment variable.

    Set it to the password for admin user on the switches. This is needed for preflight tests within the check script.

    NOTE: read -s is used to prevent the password from being written to the screen or the shell history.

    read -s SW_ADMIN_PASSWORD
    
    export SW_ADMIN_PASSWORD
    
  2. (ncn-m001#) Set the NEXUS_PASSWORD variable only if needed.

    IMPORTANT: If the password for the local Nexus admin account has been changed from the password set in the nexus-admin-credential secret (not typical), then set the NEXUS_PASSWORD environment variable to the correct admin password and export it, before running prerequisites.sh.

    For example:

    read -s is used to prevent the password from being written to the screen or the shell history.

    read -s NEXUS_PASSWORD
    
    export NEXUS_PASSWORD
    

    Otherwise, the upgrade will try to use the password in the nexus-admin-credential secret and fail to upgrade Nexus.

  3. (ncn-m001#) Run the script.

    /usr/share/doc/csm/upgrade/scripts/upgrade/prerequisites.sh --csm-version ${CSM_RELEASE}
    

    If the script ran correctly, it should end with the following output:

    [OK] - Successfully completed
    

    If the script does not end with this output, then try rerunning it. If it still fails, see Upgrade Troubleshooting. If the failure persists, then open a support ticket for guidance before proceeding.

  4. (ncn-m001#) Unset the NEXUS_PASSWORD variable, if it was set in the earlier step.

    unset NEXUS_PASSWORD
    
  5. (Optional) (ncn-m001#) Commit changes to customizations.yaml.

    customizations.yaml has been updated in this procedure. If using an external Git repository for managing customizations as recommended, then clone a local working tree and commit appropriate changes to customizations.yaml.

    For example:

    git clone <URL> site-init
    cd site-init
    kubectl -n loftsman get secret site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d - > customizations.yaml
    git add customizations.yaml
    git commit -m 'CSM 1.3 upgrade - customizations.yaml'
    git push
    
  6. (ncn-m001#) Run Ceph Latency Repair Script.

    Ceph can begin to exhibit latency over time when upgrading the cluster from previous versions. It is recommended to run the /usr/share/doc/csm/scripts/repair-ceph-latency.sh script at Known Issue: Ceph OSD latency.

  7. If performing an upgrade of CSM and additional HPE Cray EX software products using the IUF, return to the Upgrade CSM and additional products with IUF procedure. Otherwise, if performing an upgrade of only CSM, proceed to Stage 0.3.

Stage 0.3 - Update management node CFS configuration and customize worker node image

This stage updates a CFS configuration used to perform node personalization and image customization of management nodes. It also applies that CFS configuration to the management nodes and customizes the worker node image, if necessary.

Image customization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to customize an image before it is booted. Node personalization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to personalize a node after it has booted.

There are several options for this stage. Use the option which applies to the current upgrade scenario.

Option 1: Upgrade of CSM and additional products

If performing an upgrade of CSM and additional HPE Cray EX software products, this stage should not be performed. Instead, the Upgrade CSM and additional products with IUF procedure should be followed as described in the first option of the Upgrade CSM procedure, Option 1: Upgrade CSM with additional HPE Cray EX software products

That procedure will perform the appropriate steps to create a CFS configuration for management nodes and perform management node image customization during the Image Preparation step.

Option 2: Upgrade of CSM on system with additional products

Use this alternative if performing an upgrade of only CSM on a system which has additional HPE Cray EX software products installed. This upgrade scenario is uncommon in production environments. Generally, if performing an upgrade of CSM, you will also be performing an upgrade of additional HPE Cray EX software products as part of an HPC CSM software recipe upgrade. In that case, follow the scenario described above for Upgrade of CSM and additional products.

The following subsection shows how to use IUF input files to perform sat bootprep operations, in this case to assign images and configurations to management nodes.

Using sat bootprep with IUF generated input files

In order to follow this procedure, you will need to know the name of the IUF activity used to perform the initial installation of the HPE Cray EX software products. See the Activities section of the IUF documentation for more information on IUF activities. See list-activities for information about listing the IUF activities on the system. The first step provides an example showing how to find the IUF activity.

  1. (ncn-m001#) Find the IUF activity used for the most recent install of the system.

    iuf list-activities
    

    This will output a list of IUF activity names. For example, if only a single install has been performed on this system of the 24.01 recipe, the output may show a single line like this:

    24.01-recipe-install
    
  2. (ncn-m001#) Record the most recent IUF activity name and directory in environment variables.

    export ACTIVITY_NAME=
    
    export ACTIVITY_DIR="/etc/cray/upgrade/csm/iuf/${ACTIVITY_NAME}"
    
  3. (ncn-m001#) Record the media directory used for this activity in an environment variable.

    export MEDIA_DIR="$(yq r "${ACTIVITY_DIR}/state/stage_hist.yaml" 'summary.media_dir')"
    echo "${MEDIA_DIR}"
    

    This should display a path to a media directory. For example:

    /etc/cray/upgrade/csm/media/24.01-recipe-install
    
  4. (ncn-m001#) Create a directory for the sat bootprep input files and the session_vars.yaml file.

    This example uses a directory under the RBD mount used by the IUF:

    export BOOTPREP_DIR="/etc/cray/upgrade/csm/admin/bootprep-csm-${CSM_RELEASE}"
    mkdir -pv "${BOOTPREP_DIR}"
    
  5. (ncn-m001#) Copy the sat bootprep input file for management nodes into the directory.

    It is possible that the file name will differ from management-bootprep.yaml if a different file was used during the IUF activity.

    cp -pv "${MEDIA_DIR}/.bootprep-${ACTIVITY_NAME}/management-bootprep.yaml" "${BOOTPREP_DIR}"
    
  6. (ncn-m001#) Copy the session_vars.yaml file into the directory.

    cp -pv "${ACTIVITY_DIR}/state/session_vars.yaml" "${BOOTPREP_DIR}"
    
  7. (ncn-m001#) Modify the CSM version in the copied session_vars.yaml:

    yq w -i "${BOOTPREP_DIR}/session_vars.yaml" 'csm.version' "${CSM_RELEASE}"
    
  8. (ncn-m001#) Update the working_branch if one is used for the CSM product.

    By default, a working_branch is not used for the CSM product. Check if there is a working_branch specified for CSM:

    yq r "${BOOTPREP_DIR}/session_vars.yaml" 'csm.working_branch'
    

    If this produces no output, a working_branch is not in use for the CSM product, and this step can be skipped. Otherwise, it shows the name of the working branch. For example:

    integration-1.4.0
    

    In this case, be sure to manually update the version string in the working branch to match the new working branch. Then check it again. For example:

    yq w -i "${BOOTPREP_DIR}/session_vars.yaml" 'csm.working_branch' "integration-${CSM_RELEASE}"
    yq r "${BOOTPREP_DIR}/session_vars.yaml" 'csm.working_branch'
    

    This should output the name of the new CSM working branch.

  9. (ncn-m001#) Modify the default.suffix value in the copied session_vars.yaml:

    As long as the sat bootprep input file uses {{default.suffix}} in the names of the CFS configurations and IMS images, this will ensure new CFS configurations and IMS images are created with different names from the ones created in the IUF activity.

    yq w -i -- "${BOOTPREP_DIR}/session_vars.yaml" 'default.suffix' "-csm-${CSM_RELEASE}"
    
  10. (ncn-m001#) Change directory to the BOOTPREP_DIR and run sat bootprep.

    This will create a CFS configuration for management nodes, and it will use that CFS configuration to customize the images for the master, worker, and storage management nodes.

    cd "${BOOTPREP_DIR}"
    sat bootprep run --vars-file session_vars.yaml management-bootprep.yaml
    
  11. (ncn-m001#) Gather the CFS configuration name, and the IMS image names from the output of sat bootprep.

    sat bootprep will print a report summarizing the CFS configuration and IMS images it created. For example:

    ################################################################################
    CFS configurations
    ################################################################################
    +-----------------------------+
    | name                        |
    +-----------------------------+
    | management-22.4.0-csm-x.y.z |
    +-----------------------------+
    ################################################################################
    IMS images
    ################################################################################
    +-----------------------------+--------------------------------------+--------------------------------------+-----------------------------+----------------------------+
    | name                        | preconfigured_image_id               | final_image_id                       | configuration               | configuration_group_names  |
    +-----------------------------+--------------------------------------+--------------------------------------+-----------------------------+----------------------------+
    | master-secure-kubernetes    | c1bcaf00-109d-470f-b665-e7b37dedb62f | a22fb912-22be-449b-a51b-081af2d7aff6 | management-22.4.0-csm-x.y.z | Management_Master          |
    | worker-secure-kubernetes    | 8b1343c4-1c39-4389-96cb-ccb2b7fb4305 | 241822c3-c7dd-44f8-98ca-0e7c7c6426d5 | management-22.4.0-csm-x.y.z | Management_Worker          |
    | storage-secure-storage-ceph | f3dd7492-c4e5-4bb2-9f6f-8cfc9f60526c | 79ab3d85-274d-4d01-9e2b-7c25f7e108ca | storage-22.4.0-csm-x.y.z    | Management_Storage         |
    +-----------------------------+--------------------------------------+--------------------------------------+-----------------------------+----------------------------+
    
    1. Save the names of the CFS configurations from the configuration column:

      Note that the storage node configuration might be titled minimal-management- or storage- depending on the value set in the sat bootprep file.

      The following uses the values from the example output above. Be sure to modify them to match the actual values.

      export KUBERNETES_CFS_CONFIG_NAME="management-22.4.0-csm-x.y.z"
      export STORAGE_CFS_CONFIG_NAME="storage-22.4.0-csm-x.y.z"
      
    2. Save the name of the IMS images from the final_image_id column:

      The following uses the values from the example output above. Be sure to modify them to match the actual values.

      export MASTER_IMAGE_ID="a22fb912-22be-449b-a51b-081af2d7aff6"
      export WORKER_IMAGE_ID="241822c3-c7dd-44f8-98ca-0e7c7c6426d5"
      export STORAGE_IMAGE_ID="79ab3d85-274d-4d01-9e2b-7c25f7e108ca"
      
  12. (ncn-m001#) Assign the images to the management nodes in BSS.

    • Master management nodes:

      /usr/share/doc/csm/scripts/operations/node_management/assign-ncn-images.sh -m -p "$MASTER_IMAGE_ID"
      
    • Storage management nodes:

      /usr/share/doc/csm/scripts/operations/node_management/assign-ncn-images.sh -s -p "$STORAGE_IMAGE_ID"
      
    • Worker management nodes:

      /usr/share/doc/csm/scripts/operations/node_management/assign-ncn-images.sh -w -p "$WORKER_IMAGE_ID"
      
  13. (ncn-m001#) Assign the CFS configuration to the management nodes.

    This deliberately only sets the desired configuration of the components in CFS. It disables the components and does not clear their configuration states or error counts. When the nodes are rebooted to their new images later in the CSM upgrade, they will automatically be enabled in CFS, and node personalization will occur.

    1. Get the xnames of the master and worker management nodes.

      WORKER_XNAMES=$(cray hsm state components list --role Management --subrole Worker --type Node --format json |
          jq -r '.Components | map(.ID) | join(",")')
      MASTER_XNAMES=$(cray hsm state components list --role Management --subrole Master --type Node --format json |
          jq -r '.Components | map(.ID) | join(",")')
      echo "${MASTER_XNAMES},${WORKER_XNAMES}"
      
    2. Apply the CFS configuration to master nodes and worker nodes using the xnames and CFS configuration name found in the previous steps.

      /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
          --no-config-change --config-name "${KUBERNETES_CFS_CONFIG_NAME}" --no-enable --no-clear-err \
          --xnames ${MASTER_XNAMES},${WORKER_XNAMES}
      

      Successful output will end with the following:

      All components updated successfully.
      
    3. Get the xnames of the storage management nodes.

      STORAGE_XNAMES=$(cray hsm state components list --role Management --subrole Storage --type Node --format json |
          jq -r '.Components | map(.ID) | join(",")')
      echo $STORAGE_XNAMES
      
    4. Apply the CFS configuration to storage nodes using the xnames and CFS configuration name found in the previous steps.

      /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
          --no-config-change --config-name "${STORAGE_CFS_CONFIG_NAME}" --no-enable --no-clear-err \
          --xnames ${STORAGE_XNAMES}
      

      Successful output will end with the following:

      All components updated successfully.
      

Continue on to Stage 0.4.

Option 3: Upgrade of CSM on CSM-only system

Use this alternative if performing an upgrade of CSM on a CSM-only system with no other HPE Cray EX software products installed. This upgrade scenario is extremely uncommon in production environments.

  1. (ncn-m001#) Generate a new CFS configuration for the management nodes.

    This script creates a new CFS configuration that includes the CSM version in its name and applies it to the management nodes. This leaves the management node components in CFS disabled. They will be automatically enabled when they are rebooted at a later stage in the upgrade.

    /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
        --no-enable --config-name management-${CSM_RELEASE}
    

    Successful output should end with the following line:

    All components updated successfully.
    

Continue on to Stage 0.4.

Stage 0.4 - Backup workload manager data

To prevent any possibility of losing workload manager configuration data or files, a backup is required. Execute all backup procedures (for the workload manager in use) located in the Troubleshooting and Administrative Tasks sub-section of the Install a Workload Manager section of the HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX. The resulting backup data should be stored in a safe location off of the system.

If performing an upgrade of CSM and additional HPE Cray EX software products using the IUF, return to the Upgrade CSM and additional products with IUF procedure. Otherwise, if performing an upgrade of only CSM, proceed to the next step.

CSM V1.4.x -> CSM v1.4.4 Patch If you arrived here by following the CSM V1.4.x -> CSM v1.4.4 patch directions, then move onto Storage nodes in-place update. Users that arrived here while upgrading from CSM 1.3.X or earlier, continue onto Stage 0.5.

Stage 0.5 - Upgrade Ceph and stop local Docker registries

IMPORTANT If performing an upgrade to CSM 1.4.0 or 1.4.1, then skip this step. This step should only be done during an upgrade to CSM 1.4 patch version 1.4.2 or later.

Note: This step may not be necessary if it was already completed by the CSM v1.3.5 patch. If it was already run, the following steps can be re-executed to verify that Ceph daemons are using images in Nexus and the local Docker registries have been stopped.

These steps will upgrade Ceph to v16.2.13. Then the Ceph monitoring daemons’ images will be pushed to Nexus and the monitoring daemons will be redeployed so that they use these images in Nexus. Once this is complete, all Ceph daemons should be using images in Nexus and not images hosted in the local Docker registry on storage nodes. The third step stops the local Docker registry on all storage nodes.

  1. (ncn-m001#) Run Ceph upgrade to v16.2.13.

    /usr/share/doc/csm/upgrade/scripts/ceph/ceph-upgrade-tool.py --version "v16.2.13"
    
  2. (ncn-m001#) Redeploy Ceph monitoring daemons so they are using images in Nexus.

    scp /usr/share/doc/csm/scripts/operations/ceph/redeploy_monitoring_stack_to_nexus.sh ncn-s001:/srv/cray/scripts/common/redeploy_monitoring_stack_to_nexus.sh
    ssh ncn-s001 "/srv/cray/scripts/common/redeploy_monitoring_stack_to_nexus.sh"
    
  3. (ncn-m001#) Stop the local Docker registries on all storage nodes.

    scp /usr/share/doc/csm/scripts/operations/ceph/disable_local_registry.sh ncn-s001:/srv/cray/scripts/common/disable_local_registry.sh
    ssh ncn-s001 "/srv/cray/scripts/common/disable_local_registry.sh"
    

Stage 0.6 - Enable Smartmon Metrics on Storage NCNs

IMPORTANT If performing an upgrade to CSM 1.4.0 or 1.4.1, then skip this step. This step should only be done during an upgrade to CSM 1.4 patch version 1.4.2 or later.

This step will install the smart-mon rpm on storage nodes, and reconfigure the node-exporter to provide smartmon metrics.

  1. (ncn-m001#) Execute the following script.

    /usr/share/doc/csm/scripts/operations/ceph/enable-smart-mon-storage-nodes.sh
    

Overwrite default boot timeout

If you need to adjust default boot timeout (10 minutes), you can add REBOOT_TIMEOUT_IN_SECONDS in /etc/cray/upgrade/csm/myenv

example:

export CSM_ARTI_DIR=/etc/cray/upgrade/csm/csm-1.4.1/tarball/csm-1.4.1
export CSM_RELEASE=1.4.1
export CSM_REL_NAME=csm-1.4.1

...

REBOOT_TIMEOUT_IN_SECONDS=999

Stop typescript

For any typescripts that were started during this stage, stop them with the exit command.

Stage completed

This stage is completed. Continue to Stage 1 - Kubernetes Upgrade.