Reminders:
- CSM 1.3.0 or higher is required in order to upgrade to CSM 1.4.
- If any problems are encountered and the procedure or command output does not provide relevant guidance, see Relevant troubleshooting links for upgrade-related issues.
Stage 0 has several critical procedures which prepare the environment and verify if the environment is ready for the upgrade.
Smartmon Metrics on Storage NCNs(ncn-m001#) If a typescript session is already running in the shell, then first stop it with the exit command.
(ncn-m001#) Start a typescript.
script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_0.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.
(ncn-m001#) Set the CSM_RELEASE variable to the target CSM version of this upgrade.
If upgrading to a patch version of CSM, be sure to specify the correct patch version number when setting this variable.
export CSM_RELEASE=1.4.0
(ncn-m001#) Install the latest docs-csm and libcsm RPMs. These should be for the target CSM version of the upgrade, not
the currently installed CSM version. See the short procedure in
Check for latest documentation.
(ncn-m001#) Run the script to create a cephfs file share at /etc/cray/upgrade/csm.
This script creates a new cephfs file share, and will unmount the rbd device that may have been used in a previous version of CSM (if detected).
Running this script is a one time step needed only on the master node the upgrade is being initiated on (ncn-m001).
If a previous rbd mount is detected at /etc/cray/upgrade/csm, that content will be remounted and available at /mnt/csm-1.3-rbd.
/usr/share/doc/csm/scripts/mount-cephfs-share.sh
Expected output looks similar to the following:
Found previous CSM release rbd mount, moving to /mnt/csm-1.3-rbd...
Unmounting /etc/cray/upgrade/csm...
Replacing /etc/cray/upgrade/csm with /mnt/csm-1.3-rbd in /etc/fstab...
Mounting /mnt/csm-1.3-rbd to preserve previous upgrade content...
Found s3fs mount at /var/lib/admin-tools, removing...
Unmounting /var/lib/admin-tools...
Removing /var/lib/admin-tools from /etc/fstab...
Creating admin-tools ceph fs share...
Sleeping for five seconds waiting for 3 running mds.admin-tools daemons...
Sleeping for five seconds waiting for 3 running mds.admin-tools daemons...
Sleeping for five seconds waiting for 3 running mds.admin-tools daemons...
Found 3 running mds.admin-tools daemons -- continuing...
Creating admin-tools keyring...
[client.admin-tools]
key = <REDACTED>
export auth(key=<REDACTED>
Adding fstab entry for cephfs share...
Done! /etc/cray/upgrade/csm is mounted as a cephfs share!
NOTE: The following steps are not part of the upgrade procedure, but rather informative about how to access data from previous upgrades stored on an
rbddevice:
After completing the CSM upgrade, all master nodes will automatically mount the new cephfs file share at /etc/cray/upgrade/csm.
The content from a previous rbd device is still available, and can be accessed by executing the following steps:
mkdir -pv /mnt/csm-1.3-rbd
rbd map csm_admin_pool/csm_scratch_img
mount /dev/rbd0 /mnt/csm-1.3-rbd
If at some point the previous upgrade’s artifacts are no longer needed that were stored in an rbd mount, the following steps can be followed to remove the rbd:
ceph config set mon mon_allow_pool_delete true
ceph osd pool rm csm_admin_pool csm_admin_pool --yes-i-really-really-mean-it
ceph config set mon mon_allow_pool_delete false
Follow either the Direct download or Manual copy procedure.
tar file that is accessible from ncn-m001, then the Direct download procedure may be used.tar file to ncn-m001.(ncn-m001#) Set the ENDPOINT variable to the URL of the directory containing the CSM release tar file.
In other words, the full URL to the CSM release tar file must be ${ENDPOINT}/csm-${CSM_RELEASE}.tar.gz
NOTE This step is optional for Cray/HPE internal installs, if
ncn-m001can reach the internet.
ENDPOINT=https://put.the/url/here/
This step should ONLY be performed if an http proxy is required to access a public endpoint on the internet for the purpose of downloading artifacts.
CSM does NOT support the use of proxy servers for anything other than downloading artifacts from external endpoints.
The http proxy variables must be unset after the desired artifacts are downloaded. Failure to unset the http proxy variables after downloading artifacts will cause many failures in subsequent steps.
Secured:
export https_proxy=https://example.proxy.net:443
Unsecured:
export http_proxy=http://example.proxy.net:80
(ncn-m001#) Run the script.
NOTE For Cray/HPE internal installs, if ncn-m001 can reach the internet, then the --endpoint argument may be omitted.
The
prepare-assets.shscript will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the--no-delete-tarball-fileargument to theprepare-assets.shcommand below.
/usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --endpoint "${ENDPOINT}"
This step must be performed if an http proxy was set previously.
unset https_proxy
unset http_proxy
Skip the Manual copy subsection and proceed to Stage 0.2 - Prerequisites
Copy the CSM release tar file to ncn-m001.
(ncn-m001#) Set the CSM_TAR_PATH variable to the full path to the CSM tar file on ncn-m001.
CSM_TAR_PATH=/path/to/csm-${CSM_RELEASE}.tar.gz
(ncn-m001#) Run the script.
The
prepare-assets.shscript will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the--no-delete-tarball-fileargument to theprepare-assets.shcommand below.
/usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --tarball-file "${CSM_TAR_PATH}"
(ncn-m001#) Set the SW_ADMIN_PASSWORD environment variable.
Set it to the password for admin user on the switches. This is needed for preflight tests within the check script.
NOTE:
read -sis used to prevent the password from being written to the screen or the shell history.
read -s SW_ADMIN_PASSWORD
export SW_ADMIN_PASSWORD
(ncn-m001#) Set the NEXUS_PASSWORD variable only if needed.
IMPORTANT: If the password for the local Nexus
adminaccount has been changed from the password set in thenexus-admin-credentialsecret (not typical), then set theNEXUS_PASSWORDenvironment variable to the correctadminpassword and export it, before runningprerequisites.sh.For example:
read -sis used to prevent the password from being written to the screen or the shell history.read -s NEXUS_PASSWORDexport NEXUS_PASSWORDOtherwise, the upgrade will try to use the password in the
nexus-admin-credentialsecret and fail to upgrade Nexus.
(ncn-m001#) Run the script.
/usr/share/doc/csm/upgrade/scripts/upgrade/prerequisites.sh --csm-version ${CSM_RELEASE}
If the script ran correctly, it should end with the following output:
[OK] - Successfully completed
If the script does not end with this output, then try rerunning it. If it still fails, see Upgrade Troubleshooting. If the failure persists, then open a support ticket for guidance before proceeding.
(ncn-m001#) Unset the NEXUS_PASSWORD variable, if it was set in the earlier step.
unset NEXUS_PASSWORD
(Optional) (ncn-m001#) Commit changes to customizations.yaml.
customizations.yaml has been updated in this procedure. If using an external Git repository
for managing customizations as recommended, then clone a local working tree and commit
appropriate changes to customizations.yaml.
For example:
git clone <URL> site-init
cd site-init
kubectl -n loftsman get secret site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d - > customizations.yaml
git add customizations.yaml
git commit -m 'CSM 1.3 upgrade - customizations.yaml'
git push
(ncn-m001#) Run Ceph Latency Repair Script.
Ceph can begin to exhibit latency over time when upgrading the cluster from previous versions. It is recommended to run the /usr/share/doc/csm/scripts/repair-ceph-latency.sh script at Known Issue: Ceph OSD latency.
If performing an upgrade of CSM and additional HPE Cray EX software products using the IUF, return to the Upgrade CSM and additional products with IUF procedure. Otherwise, if performing an upgrade of only CSM, proceed to Stage 0.3.
This stage updates a CFS configuration used to perform node personalization and image customization of management nodes. It also applies that CFS configuration to the management nodes and customizes the worker node image, if necessary.
Image customization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to customize an image before it is booted. Node personalization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to personalize a node after it has booted.
There are several options for this stage. Use the option which applies to the current upgrade scenario.
If performing an upgrade of CSM and additional HPE Cray EX software products, this stage should not be performed. Instead, the Upgrade CSM and additional products with IUF procedure should be followed as described in the first option of the Upgrade CSM procedure, Option 1: Upgrade CSM with additional HPE Cray EX software products
That procedure will perform the appropriate steps to create a CFS configuration for management nodes and perform management node image customization during the Image Preparation step.
Use this alternative if performing an upgrade of only CSM on a system which has additional HPE Cray EX software products installed. This upgrade scenario is uncommon in production environments. Generally, if performing an upgrade of CSM, you will also be performing an upgrade of additional HPE Cray EX software products as part of an HPC CSM software recipe upgrade. In that case, follow the scenario described above for Upgrade of CSM and additional products.
The following subsection shows how to use IUF input files to perform sat bootprep operations, in this
case to assign images and configurations to management nodes.
sat bootprep with IUF generated input filesIn order to follow this procedure, you will need to know the name of the IUF activity used to
perform the initial installation of the HPE Cray EX software products. See the
Activities section of the IUF documentation for more
information on IUF activities. See list-activities
for information about listing the IUF activities on the system. The first step provides an
example showing how to find the IUF activity.
(ncn-m001#) Find the IUF activity used for the most recent install of the system.
iuf list-activities
This will output a list of IUF activity names. For example, if only a single install has been performed on this system of the 24.01 recipe, the output may show a single line like this:
24.01-recipe-install
(ncn-m001#) Record the most recent IUF activity name and directory in environment variables.
export ACTIVITY_NAME=
export ACTIVITY_DIR="/etc/cray/upgrade/csm/iuf/${ACTIVITY_NAME}"
(ncn-m001#) Record the media directory used for this activity in an environment variable.
export MEDIA_DIR="$(yq r "${ACTIVITY_DIR}/state/stage_hist.yaml" 'summary.media_dir')"
echo "${MEDIA_DIR}"
This should display a path to a media directory. For example:
/etc/cray/upgrade/csm/media/24.01-recipe-install
(ncn-m001#) Create a directory for the sat bootprep input files and the session_vars.yaml file.
This example uses a directory under the RBD mount used by the IUF:
export BOOTPREP_DIR="/etc/cray/upgrade/csm/admin/bootprep-csm-${CSM_RELEASE}"
mkdir -pv "${BOOTPREP_DIR}"
(ncn-m001#) Copy the sat bootprep input file for management nodes into the directory.
It is possible that the file name will differ from management-bootprep.yaml if a different
file was used during the IUF activity.
cp -pv "${MEDIA_DIR}/.bootprep-${ACTIVITY_NAME}/management-bootprep.yaml" "${BOOTPREP_DIR}"
(ncn-m001#) Copy the session_vars.yaml file into the directory.
cp -pv "${ACTIVITY_DIR}/state/session_vars.yaml" "${BOOTPREP_DIR}"
(ncn-m001#) Modify the CSM version in the copied session_vars.yaml:
yq w -i "${BOOTPREP_DIR}/session_vars.yaml" 'csm.version' "${CSM_RELEASE}"
(ncn-m001#) Update the working_branch if one is used for the CSM product.
By default, a working_branch is not used for the CSM product. Check if there is a
working_branch specified for CSM:
yq r "${BOOTPREP_DIR}/session_vars.yaml" 'csm.working_branch'
If this produces no output, a working_branch is not in use for the CSM product, and this step
can be skipped. Otherwise, it shows the name of the working branch. For example:
integration-1.4.0
In this case, be sure to manually update the version string in the working branch to match the new working branch. Then check it again. For example:
yq w -i "${BOOTPREP_DIR}/session_vars.yaml" 'csm.working_branch' "integration-${CSM_RELEASE}"
yq r "${BOOTPREP_DIR}/session_vars.yaml" 'csm.working_branch'
This should output the name of the new CSM working branch.
(ncn-m001#) Modify the default.suffix value in the copied session_vars.yaml:
As long as the sat bootprep input file uses {{default.suffix}} in the names of the CFS
configurations and IMS images, this will ensure new CFS configurations and IMS images are created
with different names from the ones created in the IUF activity.
yq w -i -- "${BOOTPREP_DIR}/session_vars.yaml" 'default.suffix' "-csm-${CSM_RELEASE}"
(ncn-m001#) Change directory to the BOOTPREP_DIR and run sat bootprep.
This will create a CFS configuration for management nodes, and it will use that CFS configuration to customize the images for the master, worker, and storage management nodes.
cd "${BOOTPREP_DIR}"
sat bootprep run --vars-file session_vars.yaml management-bootprep.yaml
(ncn-m001#) Gather the CFS configuration name, and the IMS image names from the output of sat bootprep.
sat bootprep will print a report summarizing the CFS configuration and IMS images it created.
For example:
################################################################################
CFS configurations
################################################################################
+-----------------------------+
| name |
+-----------------------------+
| management-22.4.0-csm-x.y.z |
+-----------------------------+
################################################################################
IMS images
################################################################################
+-----------------------------+--------------------------------------+--------------------------------------+-----------------------------+----------------------------+
| name | preconfigured_image_id | final_image_id | configuration | configuration_group_names |
+-----------------------------+--------------------------------------+--------------------------------------+-----------------------------+----------------------------+
| master-secure-kubernetes | c1bcaf00-109d-470f-b665-e7b37dedb62f | a22fb912-22be-449b-a51b-081af2d7aff6 | management-22.4.0-csm-x.y.z | Management_Master |
| worker-secure-kubernetes | 8b1343c4-1c39-4389-96cb-ccb2b7fb4305 | 241822c3-c7dd-44f8-98ca-0e7c7c6426d5 | management-22.4.0-csm-x.y.z | Management_Worker |
| storage-secure-storage-ceph | f3dd7492-c4e5-4bb2-9f6f-8cfc9f60526c | 79ab3d85-274d-4d01-9e2b-7c25f7e108ca | storage-22.4.0-csm-x.y.z | Management_Storage |
+-----------------------------+--------------------------------------+--------------------------------------+-----------------------------+----------------------------+
Save the names of the CFS configurations from the configuration column:
Note that the storage node configuration might be titled
minimal-management-orstorage-depending on the value set in the satbootprepfile.The following uses the values from the example output above. Be sure to modify them to match the actual values.
export KUBERNETES_CFS_CONFIG_NAME="management-22.4.0-csm-x.y.z"
export STORAGE_CFS_CONFIG_NAME="storage-22.4.0-csm-x.y.z"
Save the name of the IMS images from the final_image_id column:
The following uses the values from the example output above. Be sure to modify them to match the actual values.
export MASTER_IMAGE_ID="a22fb912-22be-449b-a51b-081af2d7aff6"
export WORKER_IMAGE_ID="241822c3-c7dd-44f8-98ca-0e7c7c6426d5"
export STORAGE_IMAGE_ID="79ab3d85-274d-4d01-9e2b-7c25f7e108ca"
(ncn-m001#) Assign the images to the management nodes in BSS.
Master management nodes:
/usr/share/doc/csm/scripts/operations/node_management/assign-ncn-images.sh -m -p "$MASTER_IMAGE_ID"
Storage management nodes:
/usr/share/doc/csm/scripts/operations/node_management/assign-ncn-images.sh -s -p "$STORAGE_IMAGE_ID"
Worker management nodes:
/usr/share/doc/csm/scripts/operations/node_management/assign-ncn-images.sh -w -p "$WORKER_IMAGE_ID"
(ncn-m001#) Assign the CFS configuration to the management nodes.
This deliberately only sets the desired configuration of the components in CFS. It disables the components and does not clear their configuration states or error counts. When the nodes are rebooted to their new images later in the CSM upgrade, they will automatically be enabled in CFS, and node personalization will occur.
Get the xnames of the master and worker management nodes.
WORKER_XNAMES=$(cray hsm state components list --role Management --subrole Worker --type Node --format json |
jq -r '.Components | map(.ID) | join(",")')
MASTER_XNAMES=$(cray hsm state components list --role Management --subrole Master --type Node --format json |
jq -r '.Components | map(.ID) | join(",")')
echo "${MASTER_XNAMES},${WORKER_XNAMES}"
Apply the CFS configuration to master nodes and worker nodes using the xnames and CFS configuration name found in the previous steps.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${KUBERNETES_CFS_CONFIG_NAME}" --no-enable --no-clear-err \
--xnames ${MASTER_XNAMES},${WORKER_XNAMES}
Successful output will end with the following:
All components updated successfully.
Get the xnames of the storage management nodes.
STORAGE_XNAMES=$(cray hsm state components list --role Management --subrole Storage --type Node --format json |
jq -r '.Components | map(.ID) | join(",")')
echo $STORAGE_XNAMES
Apply the CFS configuration to storage nodes using the xnames and CFS configuration name found in the previous steps.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${STORAGE_CFS_CONFIG_NAME}" --no-enable --no-clear-err \
--xnames ${STORAGE_XNAMES}
Successful output will end with the following:
All components updated successfully.
Continue on to Stage 0.4.
Use this alternative if performing an upgrade of CSM on a CSM-only system with no other HPE Cray EX software products installed. This upgrade scenario is extremely uncommon in production environments.
(ncn-m001#) Generate a new CFS configuration for the management nodes.
This script creates a new CFS configuration that includes the CSM version in its name and applies it to the management nodes. This leaves the management node components in CFS disabled. They will be automatically enabled when they are rebooted at a later stage in the upgrade.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-enable --config-name management-${CSM_RELEASE}
Successful output should end with the following line:
All components updated successfully.
Continue on to Stage 0.4.
To prevent any possibility of losing workload manager configuration data or files, a backup is required. Execute all backup procedures (for the workload manager in use) located in
the Troubleshooting and Administrative Tasks sub-section of the Install a Workload Manager section of the
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX. The resulting backup data should be stored in a safe location off of the system.
If performing an upgrade of CSM and additional HPE Cray EX software products using the IUF, return to the Upgrade CSM and additional products with IUF procedure. Otherwise, if performing an upgrade of only CSM, proceed to the next step.
CSM V1.4.x -> CSM v1.4.4 Patch If you arrived here by following the CSM V1.4.x -> CSM v1.4.4 patch directions, then move onto Storage nodes in-place update. Users that arrived here while upgrading from CSM 1.3.X or earlier, continue onto Stage 0.5.
IMPORTANT If performing an upgrade to CSM 1.4.0 or 1.4.1, then skip this step. This step should only be done during an upgrade to CSM 1.4 patch version 1.4.2 or later.
Note: This step may not be necessary if it was already completed by the CSM v1.3.5 patch.
If it was already run, the following steps can be re-executed to verify that Ceph daemons are using images
in Nexus and the local Docker registries have been stopped.
These steps will upgrade Ceph to v16.2.13. Then the Ceph monitoring daemons’ images will be pushed to Nexus and the monitoring daemons will be redeployed so that they use these images in Nexus.
Once this is complete, all Ceph daemons should be using images in Nexus and not images hosted in the local Docker registry on storage nodes.
The third step stops the local Docker registry on all storage nodes.
(ncn-m001#) Run Ceph upgrade to v16.2.13.
/usr/share/doc/csm/upgrade/scripts/ceph/ceph-upgrade-tool.py --version "v16.2.13"
(ncn-m001#) Redeploy Ceph monitoring daemons so they are using images in Nexus.
scp /usr/share/doc/csm/scripts/operations/ceph/redeploy_monitoring_stack_to_nexus.sh ncn-s001:/srv/cray/scripts/common/redeploy_monitoring_stack_to_nexus.sh
ssh ncn-s001 "/srv/cray/scripts/common/redeploy_monitoring_stack_to_nexus.sh"
(ncn-m001#) Stop the local Docker registries on all storage nodes.
scp /usr/share/doc/csm/scripts/operations/ceph/disable_local_registry.sh ncn-s001:/srv/cray/scripts/common/disable_local_registry.sh
ssh ncn-s001 "/srv/cray/scripts/common/disable_local_registry.sh"
Smartmon Metrics on Storage NCNsIMPORTANT If performing an upgrade to CSM 1.4.0 or 1.4.1, then skip this step. This step should only be done during an upgrade to CSM 1.4 patch version 1.4.2 or later.
This step will install the smart-mon rpm on storage nodes, and reconfigure the node-exporter to provide smartmon metrics.
(ncn-m001#) Execute the following script.
/usr/share/doc/csm/scripts/operations/ceph/enable-smart-mon-storage-nodes.sh
If you need to adjust default boot timeout (10 minutes), you can add REBOOT_TIMEOUT_IN_SECONDS in /etc/cray/upgrade/csm/myenv
example:
export CSM_ARTI_DIR=/etc/cray/upgrade/csm/csm-1.4.1/tarball/csm-1.4.1
export CSM_RELEASE=1.4.1
export CSM_REL_NAME=csm-1.4.1
...
REBOOT_TIMEOUT_IN_SECONDS=999
For any typescripts that were started during this stage, stop them with the exit command.
This stage is completed. Continue to Stage 1 - Kubernetes Upgrade.