Stage 0 - Prerequisites and Preflight Checks

Reminders:

Stage 0 has several critical procedures which prepare the environment and verify if the environment is ready for the upgrade.

Start typescript

  1. (ncn-m001#) If a typescript session is already running in the shell, then first stop it with the exit command.

  2. (ncn-m001#) Start a typescript.

    script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_0.txt
    export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
    

If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.

Stage 0.1 - Prepare assets

  1. (ncn-m001#) Set the CSM_RELEASE variable to the target CSM version of this upgrade.

    CSM_RELEASE=1.3.0
    CSM_REL_NAME=csm-${CSM_RELEASE}
    
  2. Acquire the latest documentation and library RPMs for the target version of the CSM upgrade.

    These may include updates, corrections, and enhancements that were not available until after the software release.

    NOTE: CSM does NOT support the use of proxy servers for anything other than downloading artifacts from external endpoints. Using http proxies in any way other than the following examples will cause many failures in subsequent steps.

    1. Check the version of the currently installed CSM documentation and CSM library.

      rpm -q docs-csm
      
    2. Download and upgrade the latest documentation RPM and CSM library.

      • Without proxy:

        wget "https://release.algol60.net/$(awk -F. '{print "csm-"$1"."$2}' <<< ${CSM_RELEASE})/docs-csm/docs-csm-latest.noarch.rpm" -O /root/docs-csm-latest.noarch.rpm
        
      • With https proxy:

        https_proxy=https://example.proxy.net:443 wget "https://release.algol60.net/$(awk -F. '{print "csm-"$1"."$2}' <<< ${CSM_RELEASE})/docs-csm/docs-csm-latest.noarch.rpm" \
            -O /root/docs-csm-latest.noarch.rpm
        
      • If this machine does not have direct internet access, then this RPM will need to be externally downloaded and copied to the system.

        curl -O "https://release.algol60.net/$(awk -F. '{print "csm-"$1"."$2}' <<< ${CSM_RELEASE})/docs-csm/docs-csm-latest.noarch.rpm"
        scp docs-csm-latest.noarch.rpm ncn-m001:/root
        ssh ncn-m001
        
    3. Install the documentation RPM.

      rpm -Uvh --force /root/docs-csm-latest.noarch.rpm
      
  3. (ncn-m001#) Create and mount an rbd device where the CSM release tarball can be stored.

    This mounts the rbd device at /etc/cray/upgrade/csm on ncn-m001. This mount is available to stage content for the install/upgrade process.

    For more information about the tool used in this procedure, including troubleshooting information, see CSM RBD Tool Usage.

    1. Initialize the Python virtual environment.

      tar xvf /usr/share/doc/csm/scripts/csm_rbd_tool.tar.gz -C /opt/cray/csm/scripts/
      
    2. Check if the rbd device already exists.

      source /opt/cray/csm/scripts/csm_rbd_tool/bin/activate
      /usr/share/doc/csm/scripts/csm_rbd_tool.py --status
      
      • Expected output if rbd device does not exist:

        Pool csm_admin_pool does not exist
        Pool csm_admin_pool exists: False
        RBD device exists None
        
      • Example output if rbd device already exists and is mounted on ncn-m002:

        [{"id":"0","pool":"csm_admin_pool","namespace":"","name":"csm_scratch_img","snap":"-","device":"/dev/rbd0"}]
        Pool csm_admin_pool exists: True
        RBD device exists True
        RBD device mounted at - ncn-m002.nmn:/etc/cray/upgrade/csm
        
    3. Perform one of the following options based on the output of the status check.

      • The rbd device does not exist.

        1. Create and map the rbd device.

          /usr/share/doc/csm/scripts/csm_rbd_tool.py --pool_action create --rbd_action create --target_host ncn-m001
          deactivate
          
      • The rbd device exists.

        1. Move the device to ncn-m001, if necessary.

          This step is not necessary if the status output indicated that the device is already mounted on ncn-m001.

          /usr/share/doc/csm/scripts/csm_rbd_tool.py --rbd_action move --target_host ncn-m001
          deactivate
          
        2. Remove leftover state file from a previous CSM upgrade, if necessary.

          IMPORTANT: If upgrading from a CSM version that had previously mounted this rbd device, then the /etc/cray/upgrade/csm/myenv file must be removed before proceeding with this upgrade, because it contains information from the previous upgrade.

          [[ -f /etc/cray/upgrade/csm/myenv ]] && rm -f /etc/cray/upgrade/csm/myenv
          
  4. Follow either the Direct download or Manual copy procedure.

    • If there is a URL for the CSM tar file that is accessible from ncn-m001, then the Direct download procedure may be used.
    • Alternatively, the Manual copy procedure may be used, which includes manually copying the CSM tar file to ncn-m001.

Direct download

  1. (ncn-m001#) Set the ENDPOINT variable to the URL of the directory containing the CSM release tar file.

    In other words, the full URL to the CSM release tar file must be ${ENDPOINT}${CSM_REL_NAME}.tar.gz

    NOTE This step is optional for Cray/HPE internal installs, if ncn-m001 can reach the internet.

    ENDPOINT=https://put.the/url/here/
    
  2. This step should ONLY be performed if an http proxy is required to access a public endpoint on the internet for the purpose of downloading artifacts. CSM does NOT support the use of proxy servers for anything other than downloading artifacts from external endpoints. The http proxy variables must be unset after the desired artifacts are downloaded. Failure to unset the http proxy variables after downloading artifacts will cause many failures in subsequent steps.

    export https_proxy=https://example.proxy.net:443
    export http_proxy=http://example.proxy.net:80
    
  3. (ncn-m001#) Run the script.

    NOTE For Cray/HPE internal installs, if ncn-m001 can reach the internet, then the --endpoint argument may be omitted.

    The prepare-assets.sh script will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the --no-delete-tarball-file argument to the prepare-assets.sh command below.

    /usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --endpoint "${ENDPOINT}"
    
  4. This step must be performed if an http proxy was set previously.

    unset https_proxy
    unset http_proxy
    
  5. Skip the Manual copy subsection and proceed to Stage 0.2 - Prerequisites

Manual copy

  1. Copy the CSM release tar file to ncn-m001.

    See Update Product Stream.

  2. (ncn-m001#) Set the CSM_TAR_PATH variable to the full path to the CSM tar file on ncn-m001.

    CSM_TAR_PATH=/path/to/${CSM_REL_NAME}.tar.gz
    
  3. (ncn-m001#) Run the script.

    The prepare-assets.sh script will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the --no-delete-tarball-file argument to the prepare-assets.sh command below.

    /usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --tarball-file "${CSM_TAR_PATH}"
    

Stage 0.2 - Prerequisites

  1. (ncn-m001#) Set the SW_ADMIN_PASSWORD environment variable.

    Set it to the password for admin user on the switches. This is needed for preflight tests within the check script.

    NOTE: read -s is used to prevent the password from being written to the screen or the shell history.

    read -s SW_ADMIN_PASSWORD
    export SW_ADMIN_PASSWORD
    
  2. (ncn-m001#) Set the NEXUS_PASSWORD variable only if needed.

    IMPORTANT: If the password for the local Nexus admin account has been changed from the password set in the nexus-admin-credential secret (not typical), then set the NEXUS_PASSWORD environment variable to the correct admin password and export it, before running prerequisites.sh.

    For example:

    read -s is used to prevent the password from being written to the screen or the shell history.

    read -s NEXUS_PASSWORD
    export NEXUS_PASSWORD
    

    Otherwise, the upgrade will try to use the password in the nexus-admin-credential secret and fail to upgrade Nexus.

  3. (ncn-m001#) Run the script.

    /usr/share/doc/csm/upgrade/scripts/upgrade/prerequisites.sh --csm-version ${CSM_RELEASE}
    

    If the script ran correctly, it should end with the following output:

    [OK] - Successfully completed
    

    If the script does not end with this output, then try rerunning it. If it still fails, see Upgrade Troubleshooting. If the failure persists, then open a support ticket for guidance before proceeding.

  4. (ncn-m001#) Unset the NEXUS_PASSWORD variable, if it was set in the earlier step.

    unset NEXUS_PASSWORD
    
  5. (Optional) (ncn-m001#) Commit changes to customizations.yaml.

    customizations.yaml has been updated in this procedure. If using an external Git repository for managing customizations as recommended, then clone a local working tree and commit appropriate changes to customizations.yaml.

    For example:

    git clone <URL> site-init
    cd site-init
    kubectl -n loftsman get secret site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d - > customizations.yaml
    git add customizations.yaml
    git commit -m 'CSM 1.3 upgrade - customizations.yaml'
    git push
    

Stage 0.3 - Customize the new NCN image and update NCN personalization configurations

There are two possible scenarios. Follow the procedure for the scenario that is applicable to the upgrade being performed.

While the names are similar, image customization is different than node personalization. Image customization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to customize an image before it is booted. Node personalization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to personalize a node after it has booted.

  • Standard upgrade - Upgrading CSM on a system that has products installed other than CSM.
  • CSM-only system upgrade - Upgrading CSM only on a CSM-only system no other products installed or being upgraded.

Standard upgrade

In most cases, administrators will be performing a standard upgrade and not a CSM-only system upgrade. In the standard upgrade, the new worker NCN images must be customized, and all NCNs must have their personalization configurations updated in CFS.

NOTE: For the standard upgrade, it will not be possible to rebuild NCNs on the current, pre-upgraded CSM version after performing these steps. Rebuilding NCNs will become the same thing as upgrading them.

  1. Prepare the pre-boot worker NCN image customizations.

    This will ensure that the CFS configuration layers are applied to perform image customization for the worker NCNs. See Worker Image Customization.

  2. Prepare the post-boot NCN personalizations.

    This will ensure that the appropriate CFS configuration layers are applied when performing post-boot node personalization of the master, storage, and worker NCNs. See NCN Node Personalization.

Continue on to Stage 0.4, skipping the CSM-only system upgrade subsection below.

CSM-only system upgrade

This upgrade scenario is extremely uncommon in production environments.

  1. (ncn-m001#) Generate a new CFS configuration for the NCNs.

    This script will also leave CFS disabled for the NCNs. CFS will automatically be re-enabled on them as they are rebooted during the upgrade.

    /usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh --no-enable
    

    Successful output should end with the following line:

    All components updated successfully.
    

Stage 0.4 - Backup workload manager data

To prevent any possibility of losing workload manager configuration data or files, a backup is required. Execute all backup procedures (for the workload manager in use) located in the Troubleshooting and Administrative Tasks sub-section of the Install a Workload Manager section of the HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX. The resulting backup data should be stored in a safe location off of the system.

Stage 0.5 - Regenerate Postgres backups

The current Postgres opt-in backups need to be re-generated to fix a known issue.

  1. (ncn-m001#) Load the updated cray-postgres-db-backup image into the nexus local registry.

    NOTE: This step is only necessary if upgrading to CSM 1.3.0, if upgrading to CSM 1.3.1 (estimated to be released the second week of January, 2023), proceed to Step 2.

    • If ncn-m001 has internet access, then use the following commands.

      NEXUS_USERNAME="$(kubectl -n nexus get secret nexus-admin-credential --template {{.data.username}} | base64 -d)"
      NEXUS_PASSWORD="$(kubectl -n nexus get secret nexus-admin-credential --template {{.data.password}} | base64 -d)"
      podman run --rm --network host quay.io/skopeo/stable copy --dest-tls-verify=false --dest-creds "${NEXUS_USERNAME}:${NEXUS_PASSWORD}" \
          docker://artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3 \
          docker://registry.local/artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3
      
    • Otherwise, use the following procedure.

      1. Save the image to a tar file from a system that does have access to the internet.

        podman pull docker://artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3
        podman save -o cray-postgres-db-backup.tar artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3
        
      2. Copy the cray-postgres-db-backup.tar to the target system under /root.

      3. Copy the tar file into the local registry on the target system:

        NEXUS_USERNAME="$(kubectl -n nexus get secret nexus-admin-credential --template {{.data.username}} | base64 -d)"
        NEXUS_PASSWORD="$(kubectl -n nexus get secret nexus-admin-credential --template {{.data.password}} | base64 -d)"
        podman run --rm --network host -v /root:/mnt quay.io/skopeo/stable copy --dest-tls-verify=false --dest-creds "${NEXUS_USERNAME}:${NEXUS_PASSWORD}" \
            docker-archive:/mnt/cray-postgres-db-backup.tar docker://registry.local/artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3
        
  2. (ncn-m001#) Regenerate the Postgres backups.

    /usr/share/doc/csm/upgrade/scripts/k8s/create_new_postgres_backups.sh
    

    Successful output should end with the following line:

    Postgres backup(s) have been successfully regenerated.
    

Stop typescript

For any typescripts that were started during this stage, stop them with the exit command.

Stage completed

This stage is completed. Continue to Stage 1 - Ceph image upgrade.