Reminders:
- CSM 1.3.0 or higher is required in order to upgrade to CSM 1.4.
- If any problems are encountered and the procedure or command output does not provide relevant guidance, see Relevant troubleshooting links for upgrade-related issues.
Stage 0 has several critical procedures which prepare the environment and verify if the environment is ready for the upgrade.
Smartmon
Metrics on Storage NCNs(ncn-m001#
) If a typescript session is already running in the shell, then first stop it with the exit
command.
(ncn-m001#
) Start a typescript.
script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_0.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.
(ncn-m001#
) Set the CSM_RELEASE
variable to the target CSM version of this upgrade.
If upgrading to a patch version of CSM, be sure to specify the correct patch version number when setting this variable.
export CSM_RELEASE=1.4.0
(ncn-m001#
) Install the latest docs-csm
and libcsm
RPMs. These should be for the target CSM version of the upgrade, not
the currently installed CSM version. See the short procedure in
Check for latest documentation.
(ncn-m001#
) Run the script to create a cephfs
file share at /etc/cray/upgrade/csm
.
This script creates a new cephfs
file share, and will unmount the rbd
device that may have been used in a previous version of CSM (if detected).
Running this script is a one time step needed only on the master node the upgrade is being initiated on (ncn-m001
).
If a previous rbd
mount is detected at /etc/cray/upgrade/csm
, that content will be remounted and available at /mnt/csm-1.3-rbd
.
/usr/share/doc/csm/scripts/mount-cephfs-share.sh
Expected output looks similar to the following:
Found previous CSM release rbd mount, moving to /mnt/csm-1.3-rbd...
Unmounting /etc/cray/upgrade/csm...
Replacing /etc/cray/upgrade/csm with /mnt/csm-1.3-rbd in /etc/fstab...
Mounting /mnt/csm-1.3-rbd to preserve previous upgrade content...
Found s3fs mount at /var/lib/admin-tools, removing...
Unmounting /var/lib/admin-tools...
Removing /var/lib/admin-tools from /etc/fstab...
Creating admin-tools ceph fs share...
Sleeping for five seconds waiting for 3 running mds.admin-tools daemons...
Sleeping for five seconds waiting for 3 running mds.admin-tools daemons...
Sleeping for five seconds waiting for 3 running mds.admin-tools daemons...
Found 3 running mds.admin-tools daemons -- continuing...
Creating admin-tools keyring...
[client.admin-tools]
key = <REDACTED>
export auth(key=<REDACTED>
Adding fstab entry for cephfs share...
Done! /etc/cray/upgrade/csm is mounted as a cephfs share!
NOTE: The following steps are not part of the upgrade procedure, but rather informative about how to access data from previous upgrades stored on an
rbd
device:
After completing the CSM upgrade, all master nodes will automatically mount the new cephfs
file share at /etc/cray/upgrade/csm
.
The content from a previous rbd
device is still available, and can be accessed by executing the following steps:
mkdir -pv /mnt/csm-1.3-rbd
rbd map csm_admin_pool/csm_scratch_img
mount /dev/rbd0 /mnt/csm-1.3-rbd
If at some point the previous upgrade’s artifacts are no longer needed that were stored in an rbd
mount, the following steps can be followed to remove the rbd
:
ceph config set mon mon_allow_pool_delete true
ceph osd pool rm csm_admin_pool csm_admin_pool --yes-i-really-really-mean-it
ceph config set mon mon_allow_pool_delete false
Follow either the Direct download or Manual copy procedure.
tar
file that is accessible from ncn-m001
, then the Direct download procedure may be used.tar
file to ncn-m001
.(ncn-m001#
) Set the ENDPOINT
variable to the URL of the directory containing the CSM release tar
file.
In other words, the full URL to the CSM release tar
file must be ${ENDPOINT}/csm-${CSM_RELEASE}.tar.gz
NOTE This step is optional for Cray/HPE internal installs, if
ncn-m001
can reach the internet.
ENDPOINT=https://put.the/url/here/
This step should ONLY be performed if an http proxy is required to access a public endpoint on the internet for the purpose of downloading artifacts.
CSM does NOT support the use of proxy servers for anything other than downloading artifacts from external endpoints.
The http proxy variables must be unset
after the desired artifacts are downloaded. Failure to unset the http proxy variables after downloading artifacts will cause many failures in subsequent steps.
Secured:
export https_proxy=https://example.proxy.net:443
Unsecured:
export http_proxy=http://example.proxy.net:80
(ncn-m001#
) Run the script.
NOTE For Cray/HPE internal installs, if ncn-m001
can reach the internet, then the --endpoint
argument may be omitted.
The
prepare-assets.sh
script will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the--no-delete-tarball-file
argument to theprepare-assets.sh
command below.
/usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --endpoint "${ENDPOINT}"
This step must be performed if an http proxy was set previously.
unset https_proxy
unset http_proxy
Skip the Manual copy
subsection and proceed to Stage 0.2 - Prerequisites
Copy the CSM release tar
file to ncn-m001
.
(ncn-m001#
) Set the CSM_TAR_PATH
variable to the full path to the CSM tar
file on ncn-m001
.
CSM_TAR_PATH=/path/to/csm-${CSM_RELEASE}.tar.gz
(ncn-m001#
) Run the script.
The
prepare-assets.sh
script will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the--no-delete-tarball-file
argument to theprepare-assets.sh
command below.
/usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --tarball-file "${CSM_TAR_PATH}"
(ncn-m001#
) Set the SW_ADMIN_PASSWORD
environment variable.
Set it to the password for admin
user on the switches. This is needed for preflight tests within the check script.
NOTE:
read -s
is used to prevent the password from being written to the screen or the shell history.
read -s SW_ADMIN_PASSWORD
export SW_ADMIN_PASSWORD
(ncn-m001#
) Set the NEXUS_PASSWORD
variable only if needed.
IMPORTANT: If the password for the local Nexus
admin
account has been changed from the password set in thenexus-admin-credential
secret (not typical), then set theNEXUS_PASSWORD
environment variable to the correctadmin
password and export it, before runningprerequisites.sh
.For example:
read -s
is used to prevent the password from being written to the screen or the shell history.read -s NEXUS_PASSWORD
export NEXUS_PASSWORD
Otherwise, the upgrade will try to use the password in the
nexus-admin-credential
secret and fail to upgrade Nexus.
(ncn-m001#
) Run the script.
/usr/share/doc/csm/upgrade/scripts/upgrade/prerequisites.sh --csm-version ${CSM_RELEASE}
If the script ran correctly, it should end with the following output:
[OK] - Successfully completed
If the script does not end with this output, then try rerunning it. If it still fails, see Upgrade Troubleshooting. If the failure persists, then open a support ticket for guidance before proceeding.
(ncn-m001#
) Unset the NEXUS_PASSWORD
variable, if it was set in the earlier step.
unset NEXUS_PASSWORD
(Optional) (ncn-m001#
) Commit changes to customizations.yaml
.
customizations.yaml
has been updated in this procedure. If using an external Git repository
for managing customizations as recommended, then clone a local working tree and commit
appropriate changes to customizations.yaml
.
For example:
git clone <URL> site-init
cd site-init
kubectl -n loftsman get secret site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d - > customizations.yaml
git add customizations.yaml
git commit -m 'CSM 1.3 upgrade - customizations.yaml'
git push
(ncn-m001#
) Run Ceph Latency Repair Script.
Ceph can begin to exhibit latency over time when upgrading the cluster from previous versions. It is recommended to run the /usr/share/doc/csm/scripts/repair-ceph-latency.sh
script at Known Issue: Ceph OSD latency.
If performing an upgrade of CSM and additional HPE Cray EX software products using the IUF, return to the Upgrade CSM and additional products with IUF procedure. Otherwise, if performing an upgrade of only CSM, proceed to Stage 0.3.
This stage updates a CFS configuration used to perform node personalization and image customization of management nodes. It also applies that CFS configuration to the management nodes and customizes the worker node image, if necessary.
Image customization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to customize an image before it is booted. Node personalization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to personalize a node after it has booted.
There are several options for this stage. Use the option which applies to the current upgrade scenario.
If performing an upgrade of CSM and additional HPE Cray EX software products, this stage should not be performed. Instead, the Upgrade CSM and additional products with IUF procedure should be followed as described in the first option of the Upgrade CSM procedure, Option 1: Upgrade CSM with additional HPE Cray EX software products
That procedure will perform the appropriate steps to create a CFS configuration for management nodes and perform management node image customization during the Image Preparation step.
Use this alternative if performing an upgrade of only CSM on a system which has additional HPE Cray EX software products installed. This upgrade scenario is uncommon in production environments. Generally, if performing an upgrade of CSM, you will also be performing an upgrade of additional HPE Cray EX software products as part of an HPC CSM software recipe upgrade. In that case, follow the scenario described above for Upgrade of CSM and additional products.
The following subsection shows how to use IUF input files to perform sat bootprep
operations, in this
case to assign images and configurations to management nodes.
sat bootprep
with IUF generated input filesIn order to follow this procedure, you will need to know the name of the IUF activity used to
perform the initial installation of the HPE Cray EX software products. See the
Activities section of the IUF documentation for more
information on IUF activities. See list-activities
for information about listing the IUF activities on the system. The first step provides an
example showing how to find the IUF activity.
(ncn-m001#
) Find the IUF activity used for the most recent install of the system.
iuf list-activities
This will output a list of IUF activity names. For example, if only a single install has been performed on this system of the 24.01 recipe, the output may show a single line like this:
24.01-recipe-install
(ncn-m001#
) Record the most recent IUF activity name and directory in environment variables.
export ACTIVITY_NAME=
export ACTIVITY_DIR="/etc/cray/upgrade/csm/iuf/${ACTIVITY_NAME}"
(ncn-m001#
) Record the media directory used for this activity in an environment variable.
export MEDIA_DIR="$(yq r "${ACTIVITY_DIR}/state/stage_hist.yaml" 'summary.media_dir')"
echo "${MEDIA_DIR}"
This should display a path to a media directory. For example:
/etc/cray/upgrade/csm/media/24.01-recipe-install
(ncn-m001#
) Create a directory for the sat bootprep
input files and the session_vars.yaml
file.
This example uses a directory under the RBD mount used by the IUF:
export BOOTPREP_DIR="/etc/cray/upgrade/csm/admin/bootprep-csm-${CSM_RELEASE}"
mkdir -pv "${BOOTPREP_DIR}"
(ncn-m001#
) Copy the sat bootprep
input file for management nodes into the directory.
It is possible that the file name will differ from management-bootprep.yaml
if a different
file was used during the IUF activity.
cp -pv "${MEDIA_DIR}/.bootprep-${ACTIVITY_NAME}/management-bootprep.yaml" "${BOOTPREP_DIR}"
(ncn-m001#
) Copy the session_vars.yaml
file into the directory.
cp -pv "${ACTIVITY_DIR}/state/session_vars.yaml" "${BOOTPREP_DIR}"
(ncn-m001#
) Modify the CSM version in the copied session_vars.yaml
:
yq w -i "${BOOTPREP_DIR}/session_vars.yaml" 'csm.version' "${CSM_RELEASE}"
(ncn-m001#
) Update the working_branch
if one is used for the CSM product.
By default, a working_branch
is not used for the CSM product. Check if there is a
working_branch
specified for CSM:
yq r "${BOOTPREP_DIR}/session_vars.yaml" 'csm.working_branch'
If this produces no output, a working_branch
is not in use for the CSM product, and this step
can be skipped. Otherwise, it shows the name of the working branch. For example:
integration-1.4.0
In this case, be sure to manually update the version string in the working branch to match the new working branch. Then check it again. For example:
yq w -i "${BOOTPREP_DIR}/session_vars.yaml" 'csm.working_branch' "integration-${CSM_RELEASE}"
yq r "${BOOTPREP_DIR}/session_vars.yaml" 'csm.working_branch'
This should output the name of the new CSM working branch.
(ncn-m001#
) Modify the default.suffix
value in the copied session_vars.yaml
:
As long as the sat bootprep
input file uses {{default.suffix}}
in the names of the CFS
configurations and IMS images, this will ensure new CFS configurations and IMS images are created
with different names from the ones created in the IUF activity.
yq w -i -- "${BOOTPREP_DIR}/session_vars.yaml" 'default.suffix' "-csm-${CSM_RELEASE}"
(ncn-m001#
) Change directory to the BOOTPREP_DIR
and run sat bootprep
.
This will create a CFS configuration for management nodes, and it will use that CFS configuration to customize the images for the master, worker, and storage management nodes.
cd "${BOOTPREP_DIR}"
sat bootprep run --vars-file session_vars.yaml management-bootprep.yaml
(ncn-m001#
) Gather the CFS configuration name, and the IMS image names from the output of sat bootprep
.
sat bootprep
will print a report summarizing the CFS configuration and IMS images it created.
For example:
################################################################################
CFS configurations
################################################################################
+-----------------------------+
| name |
+-----------------------------+
| management-22.4.0-csm-x.y.z |
+-----------------------------+
################################################################################
IMS images
################################################################################
+-----------------------------+--------------------------------------+--------------------------------------+-----------------------------+----------------------------+
| name | preconfigured_image_id | final_image_id | configuration | configuration_group_names |
+-----------------------------+--------------------------------------+--------------------------------------+-----------------------------+----------------------------+
| master-secure-kubernetes | c1bcaf00-109d-470f-b665-e7b37dedb62f | a22fb912-22be-449b-a51b-081af2d7aff6 | management-22.4.0-csm-x.y.z | Management_Master |
| worker-secure-kubernetes | 8b1343c4-1c39-4389-96cb-ccb2b7fb4305 | 241822c3-c7dd-44f8-98ca-0e7c7c6426d5 | management-22.4.0-csm-x.y.z | Management_Worker |
| storage-secure-storage-ceph | f3dd7492-c4e5-4bb2-9f6f-8cfc9f60526c | 79ab3d85-274d-4d01-9e2b-7c25f7e108ca | storage-22.4.0-csm-x.y.z | Management_Storage |
+-----------------------------+--------------------------------------+--------------------------------------+-----------------------------+----------------------------+
Save the names of the CFS configurations from the configuration
column:
Note that the storage node configuration might be titled
minimal-management-
orstorage-
depending on the value set in the satbootprep
file.The following uses the values from the example output above. Be sure to modify them to match the actual values.
export KUBERNETES_CFS_CONFIG_NAME="management-22.4.0-csm-x.y.z"
export STORAGE_CFS_CONFIG_NAME="storage-22.4.0-csm-x.y.z"
Save the name of the IMS images from the final_image_id
column:
The following uses the values from the example output above. Be sure to modify them to match the actual values.
export MASTER_IMAGE_ID="a22fb912-22be-449b-a51b-081af2d7aff6"
export WORKER_IMAGE_ID="241822c3-c7dd-44f8-98ca-0e7c7c6426d5"
export STORAGE_IMAGE_ID="79ab3d85-274d-4d01-9e2b-7c25f7e108ca"
(ncn-m001#
) Assign the images to the management nodes in BSS.
Master management nodes:
/usr/share/doc/csm/scripts/operations/node_management/assign-ncn-images.sh -m -p "$MASTER_IMAGE_ID"
Storage management nodes:
/usr/share/doc/csm/scripts/operations/node_management/assign-ncn-images.sh -s -p "$STORAGE_IMAGE_ID"
Worker management nodes:
/usr/share/doc/csm/scripts/operations/node_management/assign-ncn-images.sh -w -p "$WORKER_IMAGE_ID"
(ncn-m001#
) Assign the CFS configuration to the management nodes.
This deliberately only sets the desired configuration of the components in CFS. It disables the components and does not clear their configuration states or error counts. When the nodes are rebooted to their new images later in the CSM upgrade, they will automatically be enabled in CFS, and node personalization will occur.
Get the xnames of the master and worker management nodes.
WORKER_XNAMES=$(cray hsm state components list --role Management --subrole Worker --type Node --format json |
jq -r '.Components | map(.ID) | join(",")')
MASTER_XNAMES=$(cray hsm state components list --role Management --subrole Master --type Node --format json |
jq -r '.Components | map(.ID) | join(",")')
echo "${MASTER_XNAMES},${WORKER_XNAMES}"
Apply the CFS configuration to master nodes and worker nodes using the xnames and CFS configuration name found in the previous steps.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${KUBERNETES_CFS_CONFIG_NAME}" --no-enable --no-clear-err \
--xnames ${MASTER_XNAMES},${WORKER_XNAMES}
Successful output will end with the following:
All components updated successfully.
Get the xnames of the storage management nodes.
STORAGE_XNAMES=$(cray hsm state components list --role Management --subrole Storage --type Node --format json |
jq -r '.Components | map(.ID) | join(",")')
echo $STORAGE_XNAMES
Apply the CFS configuration to storage nodes using the xnames and CFS configuration name found in the previous steps.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-config-change --config-name "${STORAGE_CFS_CONFIG_NAME}" --no-enable --no-clear-err \
--xnames ${STORAGE_XNAMES}
Successful output will end with the following:
All components updated successfully.
Continue on to Stage 0.4.
Use this alternative if performing an upgrade of CSM on a CSM-only system with no other HPE Cray EX software products installed. This upgrade scenario is extremely uncommon in production environments.
(ncn-m001#
) Generate a new CFS configuration for the management nodes.
This script creates a new CFS configuration that includes the CSM version in its name and applies it to the management nodes. This leaves the management node components in CFS disabled. They will be automatically enabled when they are rebooted at a later stage in the upgrade.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh \
--no-enable --config-name management-${CSM_RELEASE}
Successful output should end with the following line:
All components updated successfully.
Continue on to Stage 0.4.
To prevent any possibility of losing workload manager configuration data or files, a backup is required. Execute all backup procedures (for the workload manager in use) located in
the Troubleshooting and Administrative Tasks
sub-section of the Install a Workload Manager
section of the
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX
. The resulting backup data should be stored in a safe location off of the system.
If performing an upgrade of CSM and additional HPE Cray EX software products using the IUF, return to the Upgrade CSM and additional products with IUF procedure. Otherwise, if performing an upgrade of only CSM, proceed to the next step.
CSM V1.4.x -> CSM v1.4.4 Patch If you arrived here by following the CSM V1.4.x -> CSM v1.4.4 patch directions, then move onto Storage nodes in-place update. Users that arrived here while upgrading from CSM 1.3.X or earlier, continue onto Stage 0.5.
IMPORTANT If performing an upgrade to CSM 1.4.0 or 1.4.1, then skip this step. This step should only be done during an upgrade to CSM 1.4 patch version 1.4.2 or later.
Note: This step may not be necessary if it was already completed by the CSM v1.3.5
patch.
If it was already run, the following steps can be re-executed to verify that Ceph daemons are using images
in Nexus and the local Docker registries have been stopped.
These steps will upgrade Ceph to v16.2.13
. Then the Ceph monitoring daemons’ images will be pushed to Nexus and the monitoring daemons will be redeployed so that they use these images in Nexus.
Once this is complete, all Ceph daemons should be using images in Nexus and not images hosted in the local Docker registry on storage nodes.
The third step stops the local Docker registry on all storage nodes.
(ncn-m001#
) Run Ceph upgrade to v16.2.13
.
/usr/share/doc/csm/upgrade/scripts/ceph/ceph-upgrade-tool.py --version "v16.2.13"
(ncn-m001#
) Redeploy Ceph monitoring daemons so they are using images in Nexus.
scp /usr/share/doc/csm/scripts/operations/ceph/redeploy_monitoring_stack_to_nexus.sh ncn-s001:/srv/cray/scripts/common/redeploy_monitoring_stack_to_nexus.sh
ssh ncn-s001 "/srv/cray/scripts/common/redeploy_monitoring_stack_to_nexus.sh"
(ncn-m001#
) Stop the local Docker registries on all storage nodes.
scp /usr/share/doc/csm/scripts/operations/ceph/disable_local_registry.sh ncn-s001:/srv/cray/scripts/common/disable_local_registry.sh
ssh ncn-s001 "/srv/cray/scripts/common/disable_local_registry.sh"
Smartmon
Metrics on Storage NCNsIMPORTANT If performing an upgrade to CSM 1.4.0 or 1.4.1, then skip this step. This step should only be done during an upgrade to CSM 1.4 patch version 1.4.2 or later.
This step will install the smart-mon
rpm on storage nodes, and reconfigure the node-exporter
to provide smartmon
metrics.
(ncn-m001#
) Execute the following script.
/usr/share/doc/csm/scripts/operations/ceph/enable-smart-mon-storage-nodes.sh
If you need to adjust default boot timeout (10 minutes), you can add REBOOT_TIMEOUT_IN_SECONDS
in /etc/cray/upgrade/csm/myenv
example:
export CSM_ARTI_DIR=/etc/cray/upgrade/csm/csm-1.4.1/tarball/csm-1.4.1
export CSM_RELEASE=1.4.1
export CSM_REL_NAME=csm-1.4.1
...
REBOOT_TIMEOUT_IN_SECONDS=999
For any typescripts that were started during this stage, stop them with the exit
command.
This stage is completed. Continue to Stage 1 - Kubernetes Upgrade.