This document guides an administrator through the patch update to Cray Systems Management v1.4.4
from an earlier patch version of CSM v1.4
. If upgrading from CSM v1.3.x
, then follow the procedures
described in CSM major/minor version upgrade instead.
In the unusual situation of upgrading from a pre-release version of CSM v1.4.0
, then follow the procedures
described in CSM major/minor version upgrade instead.
If there are more recent CSM v1.4
patch versions available, note that there is no need to perform
intermediate CSM v1.4
patch upgrades. Instead, consider upgrading to the latest CSM v1.4
patch release. See CSM patch version upgrade for the full
list of patch versions.
CVE-2023-48795
for SATqedr
) for increased NCN stability when using Marvell/QLogic FastLinQ Ethernet
adaptersmetal-ipxe
)cray-dns-unbound
fix for leaving existing configuration in place if new configuration fails to loadCAST-34705
: Fixed an issue where cray-hms-hmcollector-ingress
would fail parsing data from FabricHealth
CAST-35268
: Fixed an issue where BOS warnings would be displayed in SAT status outputsmartmon
metrics on storage NCNsValidate CSM health.
See Validate CSM Health.
Run the CSM health checks to ensure that everything is working properly before the upgrade starts. After the upgrade is completed, another health check is performed. It is important to know if any problems observed at that time existed prior to the upgrade.
IMPORTANT: See the CSM Install Validation and Health Checks procedures in the documentation for the CURRENT CSM version on the system. The validation procedures in the CSM documentation are only intended to work with that specific version of CSM.
(ncn-m001#
) Start a typescript on ncn-m001
to capture the commands and output from this procedure.
script -af csm-update.$(date +%Y-%m-%d).txt
export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
Download and install/upgrade the latest documentation on ncn-m001
.
(ncn-m001#
) Run the script to create a cephfs
file share at /etc/cray/upgrade/csm
.
This script creates a new cephfs
file share, and will unmount the rbd
device that may have been used in a previous version of CSM (if detected).
Running this script is a one time step needed only on the master node the upgrade is being initiated on (ncn-m001
).
If a previous rbd
mount is detected at /etc/cray/upgrade/csm
, that content will be remounted and available at /mnt/csm-1.3-rbd
.
/usr/share/doc/csm/scripts/mount-cephfs-share.sh
Download and extract the CSM v1.4.4
release to ncn-m001
.
Change into the cephfs
file share.
cd /etc/cray/upgrade/csm/
Follow the directions for Download and Extract CSM Product Release.
(ncn-m001#
) Set CSM_DISTDIR
to the directory of the extracted files.
IMPORTANT If necessary, change this command to match the actual location of the extracted files.
NOTE
CSM_RELEASE
is set during the Download and Extract CSM Product Release guide.
export CSM_RELEASE_VERSION="$CSM_RELEASE"
export CSM_DISTDIR="$(pwd)/csm-${CSM_RELEASE}"
echo "${CSM_DISTDIR}"
(ncn-m001#
) Run lib/setup-nexus.sh
to configure Nexus and upload new CSM RPM repositories, container images, and
Helm charts:
cd "$CSM_DISTDIR" && ./lib/setup-nexus.sh
echo "RC=$?"
On success, the output should end with the following:
+ Nexus setup complete
setup-nexus.sh: OK
RC=0
In the event of an error,
consult Troubleshoot Nexus
to resolve potential problems and then try running setup-nexus.sh
again. Note that subsequent runs of setup-nexus.sh
may report FAIL
when uploading duplicate assets. This is okay as long as setup-nexus.sh
outputs setup-nexus.sh: OK
and exits with status code 0
.
(ncn-m001#
) Run the following script in preparation for 1.4.4 patch upgrade:
function run_cmd {
"$@" && return 0 || echo "ERROR: Command failed with rc $?: $*" >&2 ; return 1
}
for c in $(kubectl get crd | grep argo | cut -d' ' -f1); do
run_cmd kubectl label --overwrite crd $c app.kubernetes.io/managed-by="Helm" || break
run_cmd kubectl annotate --overwrite crd $c meta.helm.sh/release-name="cray-nls" || break
run_cmd kubectl annotate --overwrite crd $c meta.helm.sh/release-namespace="argo" || break
done
(ncn-m001#
) Run upgrade.sh
to deploy upgraded CSM applications and services:
cd "$CSM_DISTDIR" && ./upgrade.sh
It is important to upload NCN images to IMS and to edit the cray-product-catalog
. This is necessary when updating
products with IUF. If this step is skipped, IUF will fail when updating or upgrading products in the future.
(ncn-m001#
) Execute script to upload CSM NCN images and update the cray-product-catalog
.
/usr/share/doc/csm/upgrade/scripts/upgrade/upload-ncn-images.sh
Note: This step is not necessary if it was already completed by the CSM v1.3.5
patch, CSM v1.4.2
patch, or
CSM V1.4.3
patch. If it was already run, the following steps can be re-executed to verify that Ceph daemons are using
images in Nexus and the local Docker registries have been stopped.
These steps will upgrade Ceph to v16.2.13
. Then the Ceph monitoring daemons’ images will be pushed to Nexus and the
monitoring daemons will be redeployed so that they use these images in Nexus. Once this is complete, all Ceph daemons
should be using images in Nexus and not images hosted in the local Docker registry on storage nodes. The third step
stops the local Docker registry on all storage nodes.
(ncn-m001#
) Run Ceph upgrade to v16.2.13
.
/usr/share/doc/csm/upgrade/scripts/ceph/ceph-upgrade-tool.py --version "v16.2.13"
(ncn-m001#
) Redeploy Ceph monitoring daemons so they are using images in Nexus.
scp /usr/share/doc/csm/scripts/operations/ceph/redeploy_monitoring_stack_to_nexus.sh ncn-s001:/srv/cray/scripts/common/redeploy_monitoring_stack_to_nexus.sh
ssh ncn-s001 /srv/cray/scripts/common/redeploy_monitoring_stack_to_nexus.sh
(ncn-m001#
) Stop the local Docker registries on all storage nodes.
scp /usr/share/doc/csm/scripts/operations/ceph/disable_local_registry.sh ncn-s001:/srv/cray/scripts/common/disable_local_registry.sh
ssh ncn-s001 /srv/cray/scripts/common/disable_local_registry.sh
smartmon
metrics on storage NCNsThis step will install the smart-mon
rpm on storage nodes, and reconfigure the node-exporter
to provide smartmon
metrics.
(ncn-m001#
) Execute the following script.
/usr/share/doc/csm/scripts/operations/ceph/enable-smart-mon-storage-nodes.sh
This step updates the CFS configuration which is set as the desired configuration for the management nodes (NCNs). It ensures that the CFS configuration layers reference the correct commit hash for the version of CSM being installed. It then waits for the components to reach a configured state in CFS.
(ncn-m001#
) Update CFS configuration.
cd "$CSM_DISTDIR"
./update-mgmt-ncn-cfs-config.sh --base-query role=management \
--save --create-backups --clear-error
The output will look similar to the truncated output shown below.
INFO: Querying CFS configurations for the following NCNs: x3000c0s5b0n0, ...
INFO: Found configuration "management-csm-1.4.0" for component x3000c0s5b0n0
...
INFO: Updating existing layer with repo path /vcs/cray/csm-config-management.git and playbook site.yml
INFO: Property "commit" of layer with repo path /vcs/cray/csm-config-management.git and playbook site.yml updated ...
INFO: Property "name" of layer with repo path /vcs/cray/csm-config-management.git and playbook site.yml updated ...
INFO: No layer with repo path /vcs/cray/csm-config-management.git and playbook ncn-initrd.yml found.
INFO: Adding a layer with repo path /vcs/cray/csm-config-management.git and playbook ncn-initrd.yml to the end.
INFO: Successfully saved CFS configuration "management-csm-1.4.0-backup-20230918T205149"
INFO: Successfully saved CFS configuration "management-csm-1.4.0"
INFO: Successfully saved 1 changed CFS configuration(s) to CFS.
INFO: Updated 9 CFS components.
INFO: Waiting for 9 component(s) to finish configuration
INFO: Summary of number of components in each status: pending: 9
INFO: Waiting for 9 pending component(s)
INFO: Sleeping for 30 seconds before checking status of 9 pending component(s).
...
INFO: Sleeping for 30 seconds before checking status of 9 pending component(s).
INFO: 9 pending components transitioned to status configured: x3000c0s5b0n0, ...
INFO: Finished waiting for 9 component(s) to finish configuration.
INFO: Summary of number of components in each status: configured: 9
====> Completed update of CFS configuration(s)
====> Cleaning up install dependencies
When configuration of all components is successful, the summary line will show all components with status “configured”.
NCN images must be rebuilt at this time in order to acquire an important Kernel panic mitigation. The mitigation entails a new Kernel and networking drivers for SP4 images (Kubernetes), as well as the blacklisting of the QLogic RDMA driver for SP4 and SP3 (all NCNs).
Despite rebuilding both Kubernetes and Storage CEPH images, only Kubernetes nodes will embark on a rolling rebuilt. Storage CEPH nodes will receive an in-place modification, and do not need to be rebuilt at this time.
IMPORTANT This minor version bump has an unprecedented rolling rebuild. This is a friendly reminder that any system administration data living on masters and workers will be wiped during the rebuild. Administrators are advised to take backups of their local site files.
Examples:
~/.config/sat/sat
/etc/motd
/etc/sudoers
/home
/root/.ssh/config
Print the product catalog ConfigMap
.
kubectl -n services get cm cray-product-catalog -o jsonpath='{.data}' | jq '. | keys'
Example outputs:
CSM running with additional products:
[
"HFP-firmware",
"analytics",
"cos",
"cos-base",
"cpe",
"cpe-aarch64",
"cray-sdu-rda",
"csm",
"csm-diags",
"hfp",
"hpc-csm-software-recipe",
"pbs",
"sat",
"sle-os-backports-15-sp3",
"sle-os-backports-15-sp4",
"sle-os-backports-sle-15-sp3-x86_64",
"sle-os-backports-sle-15-sp4-x86_64",
"sle-os-backports-sle-15-sp5-aarch64",
"sle-os-backports-sle-15-sp5-x86_64",
"sle-os-products-15-sp3",
"sle-os-products-15-sp3-x86_64",
"sle-os-products-15-sp4",
"sle-os-products-15-sp4-x86_64",
"sle-os-products-15-sp5-aarch64",
"sle-os-products-15-sp5-x86_64",
"sle-os-updates-15-sp3",
"sle-os-updates-15-sp3-x86_64",
"sle-os-updates-15-sp4",
"sle-os-updates-15-sp4-x86_64",
"sle-os-updates-15-sp5-aarch64",
"sle-os-updates-15-sp5-x86_64",
"slingshot",
"slingshot-host-software",
"slurm",
"sma",
"uan",
"uss"
]
CSM on a CSM-only system:
[
"csm"
]
Choose one of the following options based on the output from the previous step.
NOTE For CSM-only systems, skip this step and continue onto Storage nodes in-place update
Follow the directions in Stage 0.4.
In lieu of rebuilding the storage nodes, they will be live patched.
(ncn-m001#
) Unload and blacklist the QLogic RDMA qedr
driver.
/usr/share/doc/csm/upgrade/1.4.4/scripts/storage-in-place-patch.sh
(ncn-m001#
) Verify that qedr
is no longer loaded.
pdsh -b -w $(grep -oP 'ncn-s\d+' /etc/hosts | sort -u | tr -t '\n' ',') \
'lsmod | grep -Eo '\''^qedr'\'' || echo OK' \
| dshbak -c
Expected output:
----------------
ncn-s[001-003]
----------------
OK
(ncn-m001#
) Set environment variables.
NOTE This relies on variables set during preparation.
export CSM_REL_NAME="csm-${CSM_RELEASE}"
export CSM_ARTI_DIR="${CSM_DISTDIR}"
(ncn-m001#
) Set/update re-usable environment variables.
sed -i '/^export CSM_ARTI_DIR=.*/d' /etc/cray/upgrade/csm/myenv
echo "export CSM_ARTI_DIR=$CSM_ARTI_DIR" >>/etc/cray/upgrade/csm/myenv
(ncn-m001#
) Ensure cray-site-init
is installed, use the latest one provided by the CSM tarball.
zypper install -y cray-site-init
Proceed with the following sections from Stage 1:
(ncn-m001#
) Verify the booted images match the expected output.
pdsh -b -w $(grep -oP 'ncn-\w\d+' /etc/hosts | sort -u | tr -t '\n' ',') '
rpm -q kernel-default
rpm -q qlgc-fastlinq-kmp-default
grep -q qedr /etc/modprobe.d/disabled-modules.conf 2>/dev/null && echo "OK - rootfs blacklist" || echo "NOT OK - rootfs blacklist"
grep -q qedr /etc/dracut.conf.d/99-csm-ansible.conf 2>/dev/null && echo "OK - initrd blacklist" || echo "NOT OK - initrd blacklist"
lsmod | grep -qoE '\''^qedr'\'' && echo "NOT OK - qedr loaded" || echo "OK - no qedr"
lsinitrd /metal/recovery/boot/initrd.img.xz | grep -q '\''qedr'\'' && echo "NOT OK - initrd has qedr" || echo "OK - initrd no qedr"
' | dshbak -c
Expected output:
----------------
ncn-m[001-003],ncn-w[001-005]
----------------
kernel-default-5.14.21-150400.24.100.2.27359.1.PTF.1215587.x86_64
qlgc-fastlinq-kmp-default-8.74.1.0_k5.14.21_150400.22-1.sles15sp4.x86_64
OK - rootfs blacklist
OK - initrd blacklistt
OK - no qedr
OK - initrd no qedr
----------------
ncn-s[001-003]
----------------
kernel-default-5.3.18-150300.59.87.1.x86_64
package qlgc-fastlinq-kmp-default is not installed
OK - rootfs blacklist
OK - initrd blacklist
OK - no qedr
OK - initrd no qedr
(ncn-m001#
) Update select RPMs on the NCNs.
/usr/share/doc/csm/upgrade/scripts/upgrade/util/upgrade-test-rpms.sh
Verify that the new CSM version is in the product catalog.
(ncn-m001#
) Verify that the new CSM version is listed in the output of the following command:
kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r -j - | jq -r 'to_entries[] | .key' | sort -V
Example output that includes the new CSM version (1.4.4
):
0.9.2
0.9.3
0.9.4
0.9.5
0.9.6
1.0.1
1.0.10
1.2.0
1.2.1
1.2.2
1.3.0
1.3.1
1.4.0
1.4.1
1.4.2
1.4.3
1.4.4
Confirm that the product catalog has an accurate timestamp for the CSM upgrade.
(ncn-m001#
) Confirm that the import_date
reflects the timestamp of the upgrade.
kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r - '"1.4.4".configuration.import_date'
(ncn-m001#
) Execute the following script to take a manual backup of the Etcd clusters.
/usr/share/doc/csm/scripts/operations/etcd/take-etcd-manual-backups.sh post_patch
These clusters are automatically backed up every 24 hours, but taking a manual backup at this stage in the upgrade enables restoring from backup later in this process if needed.
(ncn-m001#
) Remember to exit the typescript that was started at the beginning of the upgrade.
exit
NOTE It is recommended to save the typescript file for later reference.