Reminders:
- CSM 1.2.0 or higher is required in order to upgrade to CSM 1.3.0.
- If any problems are encountered and the procedure or command output does not provide relevant guidance, see Relevant troubleshooting links for upgrade-related issues.
Stage 0 has several critical procedures which prepare the environment and verify if the environment is ready for the upgrade.
(ncn-m001#
) If a typescript session is already running in the shell, then first stop it with the exit
command.
(ncn-m001#
) Start a typescript.
script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_0.txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.
(ncn-m001#
) Set the CSM_RELEASE
variable to the target CSM version of this upgrade.
CSM_RELEASE=1.3.0
CSM_REL_NAME=csm-${CSM_RELEASE}
Acquire the latest documentation and library RPMs for the target version of the CSM upgrade.
These may include updates, corrections, and enhancements that were not available until after the software release.
NOTE: CSM does NOT support the use of proxy servers for anything other than downloading artifacts from external endpoints. Using http proxies in any way other than the following examples will cause many failures in subsequent steps.
Check the version of the currently installed CSM documentation and CSM library.
rpm -q docs-csm
Download and upgrade the latest documentation RPM and CSM library.
Without proxy:
wget "https://release.algol60.net/$(awk -F. '{print "csm-"$1"."$2}' <<< ${CSM_RELEASE})/docs-csm/docs-csm-latest.noarch.rpm" -O /root/docs-csm-latest.noarch.rpm
With https proxy:
https_proxy=https://example.proxy.net:443 wget "https://release.algol60.net/$(awk -F. '{print "csm-"$1"."$2}' <<< ${CSM_RELEASE})/docs-csm/docs-csm-latest.noarch.rpm" \
-O /root/docs-csm-latest.noarch.rpm
If this machine does not have direct internet access, then this RPM will need to be externally downloaded and copied to the system.
curl -O "https://release.algol60.net/$(awk -F. '{print "csm-"$1"."$2}' <<< ${CSM_RELEASE})/docs-csm/docs-csm-latest.noarch.rpm"
scp docs-csm-latest.noarch.rpm ncn-m001:/root
ssh ncn-m001
Install the documentation RPM.
rpm -Uvh --force /root/docs-csm-latest.noarch.rpm
(ncn-m001#
) Create and mount an rbd
device where the CSM release tarball can be stored.
This mounts the rbd
device at /etc/cray/upgrade/csm
on ncn-m001
. This mount is available to stage content for the install/upgrade process.
For more information about the tool used in this procedure, including troubleshooting information, see CSM RBD Tool Usage.
Initialize the Python virtual environment.
tar xvf /usr/share/doc/csm/scripts/csm_rbd_tool.tar.gz -C /opt/cray/csm/scripts/
Check if the rbd
device already exists.
source /opt/cray/csm/scripts/csm_rbd_tool/bin/activate
/usr/share/doc/csm/scripts/csm_rbd_tool.py --status
Expected output if rbd
device does not exist:
Pool csm_admin_pool does not exist
Pool csm_admin_pool exists: False
RBD device exists None
Example output if rbd
device already exists and is mounted on ncn-m002
:
[{"id":"0","pool":"csm_admin_pool","namespace":"","name":"csm_scratch_img","snap":"-","device":"/dev/rbd0"}]
Pool csm_admin_pool exists: True
RBD device exists True
RBD device mounted at - ncn-m002.nmn:/etc/cray/upgrade/csm
Perform one of the following options based on the output of the status check.
The rbd
device does not exist.
Create and map the rbd
device.
/usr/share/doc/csm/scripts/csm_rbd_tool.py --pool_action create --rbd_action create --target_host ncn-m001
deactivate
The rbd
device exists.
Move the device to ncn-m001
, if necessary.
This step is not necessary if the status output indicated that the device is already mounted on ncn-m001
.
/usr/share/doc/csm/scripts/csm_rbd_tool.py --rbd_action move --target_host ncn-m001
deactivate
Remove leftover state file from a previous CSM upgrade, if necessary.
IMPORTANT: If upgrading from a CSM version that had previously mounted this rbd
device, then the /etc/cray/upgrade/csm/myenv
file must be removed before proceeding with this upgrade, because it contains information from the previous upgrade.
[[ -f /etc/cray/upgrade/csm/myenv ]] && rm -f /etc/cray/upgrade/csm/myenv
Follow either the Direct download or Manual copy procedure.
tar
file that is accessible from ncn-m001
, then the Direct download procedure may be used.tar
file to ncn-m001
.(ncn-m001#
) Set the ENDPOINT
variable to the URL of the directory containing the CSM release tar
file.
In other words, the full URL to the CSM release tar
file must be ${ENDPOINT}${CSM_REL_NAME}.tar.gz
NOTE This step is optional for Cray/HPE internal installs, if ncn-m001
can reach the internet.
ENDPOINT=https://put.the/url/here/
This step should ONLY be performed if an http proxy is required to access a public endpoint on the internet for the purpose of downloading artifacts.
CSM does NOT support the use of proxy servers for anything other than downloading artifacts from external endpoints.
The http proxy variables must be unset
after the desired artifacts are downloaded. Failure to unset the http proxy variables after downloading artifacts will cause many failures in subsequent steps.
export https_proxy=https://example.proxy.net:443
export http_proxy=http://example.proxy.net:80
(ncn-m001#
) Run the script.
NOTE For Cray/HPE internal installs, if ncn-m001
can reach the internet, then the --endpoint
argument may be omitted.
The
prepare-assets.sh
script will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the--no-delete-tarball-file
argument to theprepare-assets.sh
command below.
/usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --endpoint "${ENDPOINT}"
This step must be performed if an http proxy was set previously.
unset https_proxy
unset http_proxy
Skip the Manual copy
subsection and proceed to Stage 0.2 - Prerequisites
Copy the CSM release tar
file to ncn-m001
.
(ncn-m001#
) Set the CSM_TAR_PATH
variable to the full path to the CSM tar
file on ncn-m001
.
CSM_TAR_PATH=/path/to/${CSM_REL_NAME}.tar.gz
(ncn-m001#
) Run the script.
The
prepare-assets.sh
script will delete the CSM tarball (after expanding it) in order to free up space. This behavior can be overridden by appending the--no-delete-tarball-file
argument to theprepare-assets.sh
command below.
/usr/share/doc/csm/upgrade/scripts/upgrade/prepare-assets.sh --csm-version ${CSM_RELEASE} --tarball-file "${CSM_TAR_PATH}"
(ncn-m001#
) Set the SW_ADMIN_PASSWORD
environment variable.
Set it to the password for admin
user on the switches. This is needed for preflight tests within the check script.
NOTE:
read -s
is used to prevent the password from being written to the screen or the shell history.
read -s SW_ADMIN_PASSWORD
export SW_ADMIN_PASSWORD
(ncn-m001#
) Set the NEXUS_PASSWORD
variable only if needed.
IMPORTANT: If the password for the local Nexus
admin
account has been changed from the password set in thenexus-admin-credential
secret (not typical), then set theNEXUS_PASSWORD
environment variable to the correctadmin
password and export it, before runningprerequisites.sh
.For example:
read -s
is used to prevent the password from being written to the screen or the shell history.read -s NEXUS_PASSWORD export NEXUS_PASSWORD
Otherwise, the upgrade will try to use the password in the
nexus-admin-credential
secret and fail to upgrade Nexus.
(ncn-m001#
) Run the script.
/usr/share/doc/csm/upgrade/scripts/upgrade/prerequisites.sh --csm-version ${CSM_RELEASE}
If the script ran correctly, it should end with the following output:
[OK] - Successfully completed
If the script does not end with this output, then try rerunning it. If it still fails, see Upgrade Troubleshooting. If the failure persists, then open a support ticket for guidance before proceeding.
(ncn-m001#
) Unset the NEXUS_PASSWORD
variable, if it was set in the earlier step.
unset NEXUS_PASSWORD
(Optional) (ncn-m001#
) Commit changes to customizations.yaml
.
customizations.yaml
has been updated in this procedure. If using an external Git repository
for managing customizations as recommended, then clone a local working tree and commit
appropriate changes to customizations.yaml
.
For example:
git clone <URL> site-init
cd site-init
kubectl -n loftsman get secret site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d - > customizations.yaml
git add customizations.yaml
git commit -m 'CSM 1.3 upgrade - customizations.yaml'
git push
There are two possible scenarios. Follow the procedure for the scenario that is applicable to the upgrade being performed.
While the names are similar, image customization is different than node personalization. Image customization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to customize an image before it is booted. Node personalization is the process of using Ansible stored in VCS in conjunction with the CFS and IMS microservices to personalize a node after it has booted.
In most cases, administrators will be performing a standard upgrade and not a CSM-only system upgrade. In the standard upgrade, the new worker NCN images must be customized, and all NCNs must have their personalization configurations updated in CFS.
NOTE: For the standard upgrade, it will not be possible to rebuild NCNs on the current, pre-upgraded CSM version after performing these steps. Rebuilding NCNs will become the same thing as upgrading them.
Prepare the pre-boot worker NCN image customizations.
This will ensure that the CFS configuration layers are applied to perform image customization for the worker NCNs. See Worker Image Customization.
Prepare the post-boot NCN personalizations.
This will ensure that the appropriate CFS configuration layers are applied when performing post-boot node personalization of the master, storage, and worker NCNs. See NCN Node Personalization.
Continue on to Stage 0.4, skipping the CSM-only system upgrade subsection below.
This upgrade scenario is extremely uncommon in production environments.
(ncn-m001#
) Generate a new CFS configuration for the NCNs.
This script will also leave CFS disabled for the NCNs. CFS will automatically be re-enabled on them as they are rebooted during the upgrade.
/usr/share/doc/csm/scripts/operations/configuration/apply_csm_configuration.sh --no-enable
Successful output should end with the following line:
All components updated successfully.
To prevent any possibility of losing workload manager configuration data or files, a backup is required. Execute all backup procedures (for the workload manager in use) located in
the Troubleshooting and Administrative Tasks
sub-section of the Install a Workload Manager
section of the
HPE Cray Programming Environment Installation Guide: CSM on HPE Cray EX
. The resulting backup data should be stored in a safe location off of the system.
The current Postgres opt-in backups need to be re-generated to fix a known issue.
(ncn-m001#
) Load the updated cray-postgres-db-backup
image into the nexus local registry.
NOTE: This step is only necessary if upgrading to CSM 1.3.0, if upgrading to CSM 1.3.1 (estimated to be released the second week of January, 2023), proceed to Step 2.
If ncn-m001
has internet access, then use the following commands.
NEXUS_USERNAME="$(kubectl -n nexus get secret nexus-admin-credential --template {{.data.username}} | base64 -d)"
NEXUS_PASSWORD="$(kubectl -n nexus get secret nexus-admin-credential --template {{.data.password}} | base64 -d)"
podman run --rm --network host quay.io/skopeo/stable copy --dest-tls-verify=false --dest-creds "${NEXUS_USERNAME}:${NEXUS_PASSWORD}" \
docker://artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3 \
docker://registry.local/artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3
Otherwise, use the following procedure.
Save the image to a tar
file from a system that does have access to the internet.
podman pull docker://artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3
podman save -o cray-postgres-db-backup.tar artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3
Copy the cray-postgres-db-backup.tar
to the target system under /root
.
Copy the tar
file into the local registry on the target system:
NEXUS_USERNAME="$(kubectl -n nexus get secret nexus-admin-credential --template {{.data.username}} | base64 -d)"
NEXUS_PASSWORD="$(kubectl -n nexus get secret nexus-admin-credential --template {{.data.password}} | base64 -d)"
podman run --rm --network host -v /root:/mnt quay.io/skopeo/stable copy --dest-tls-verify=false --dest-creds "${NEXUS_USERNAME}:${NEXUS_PASSWORD}" \
docker-archive:/mnt/cray-postgres-db-backup.tar docker://registry.local/artifactory.algol60.net/csm-docker/stable/cray-postgres-db-backup:0.2.3
(ncn-m001#
) Regenerate the Postgres backups.
/usr/share/doc/csm/upgrade/scripts/k8s/create_new_postgres_backups.sh
Successful output should end with the following line:
Postgres backup(s) have been successfully regenerated.
For any typescripts that were started during this stage, stop them with the exit
command.
This stage is completed. Continue to Stage 1 - Ceph image upgrade.