This procedure covers patching CVE-2021-22555
and CVE-2021-33909
on Shasta V1.4.X (and upgrades CSM to v0.9.5).
These special directions are only for Linux dependencies, such as the kernel and internal packages compiled against the kernel.
A high-level overview of the procedure is as follows:
Procedures:
Start a typescript to capture the commands and output from this procedure.
ncn-m001# script -af csm-update.$(date +%Y-%m-%d).txt
ncn-m001# export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
NOTE:
Installed CSM versions may be listed from the product catalog using:ncn-m001# kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r -j - | jq -r 'to_entries[] | .key' | sort -V 0.9.2 0.9.3 0.9.4
Set CSM_DISTDIR
to the directory of the extracted release distribution for CSM 0.9.5:
NOTE:
Use--no-same-owner
and--no-same-permissions
options totar
when extracting a CSM release distribution asroot
to ensure the currentumask
value.
If using a release distribution:
ncn-m001# tar --no-same-owner --no-same-permissions -zxvf csm-0.9.5.tar.gz
ncn-m001# export CSM_DISTDIR="$(pwd)/csm-0.9.5"
Set CSM_RELEASE_VERSION
to the version reported by ${CSM_DISTDIR}/lib/version.sh
:
ncn-m001# CSM_RELEASE_VERSION="$(${CSM_DISTDIR}/lib/version.sh --version)"
ncn-m001# echo $CSM_RELEASE_VERSION
Install/upgrade CSI.
linux# rpm -Uvh --force ${CSM_DISTDIR}/rpm/cray/csm/sle-15sp2/x86_64/cray-site-init-*.x86_64.rpm
When installing or upgrading CSI the following error may appear, it can be safely ignored.
rm: cannot remove ‘/usr/bin/sic’: No such file or directory
Download and install/upgrade the latest documentation RPM. If this machine does not have direct internet access these RPMs will need to be externally downloaded and then copied to be installed.
ncn-m001# rpm -Uvh https://storage.googleapis.com/csm-release-public/shasta-1.4/docs-csm/docs-csm-latest.noarch.rpm
Set CSM_SCRIPTDIR
to the scripts directory included in the docs-csm RPM for the CSM 0.9.5 patch:
ncn-m001# export CSM_SCRIPTDIR=/usr/share/doc/metal/upgrade/0.9/csm-0.9.5/scripts
It is important to first verify a healthy starting state. To do this, run the CSM validation checks. If any problems are found, correct them and verify the appropriate validation checks before proceeding.
Run lib/remove-service-repos.sh
to remove repositories that are external to the system.
ncn-m001# ${CSM_SCRIPTDIR}/remove-service-repos.sh
The run-patch.sh
script expects that the TOKEN
environment variable is set. Either set this to a valid token of
your choosing or get a new one using the following:
ncn-m001# export TOKEN=$(curl -k -s -S -d grant_type=client_credentials \
-d client_id=admin-client \
-d client_secret=`kubectl get secrets admin-client-auth -o jsonpath='{.data.client-secret}' | base64 -d` \
https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token | jq -r '.access_token')
The script also expects that the Cray CLI is configured and authenticated. Please see Initialize cray CLI for more information on how to do this.
Run the run-patch.sh
script. This does a few things:
Updates all the NCNs via zypper
to have the latest patched packages.
Patches the kernel/initrd/squash image to have the correctly patched assets.
Applies a pod priority to essential deployments to ensure that they are scheduled when rebooting the NCNs.
This step requires the latest SUSE updates tarball has been extracted and installed (i.e., synced with Nexus).
Please see section, “Install SLE for V1.4.2A-security0821 Patch” in the main patch README if you have not already.
ncn-m001# "${CSM_SCRIPTDIR}/run-patch.sh"
DO NOT REBOOT
The
zypper
commands issued by therun-patch.sh
script may indicate a reboot is needed at several points during the script run but this will happen in a later step so do not reboot the NCNs yet.
Reference the Reboot NCNs procedure.
Start a typescript to capture the commands and output from this procedure.
ncn-m001# script -af csm-update-post-reboot.$(date +%Y-%m-%d).txt
ncn-m001# export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
Setup CSM_DISTDIR
to point toward the location where the extracted csm-0.9.5 tarball.
If the tarball was not extracted to
~/csm-0.9.5
, then provide the alternative path instead.
ncn-m001# CSM_DISTDIR=~/csm-0.9.5
Once a system has booted, verify the new kernel is running on each NCN. This should match 5.3.18-24.75-default
, which is the version
of the kernel that addresses the CVE.
ncn-m001# cd "$CSM_DISTDIR"
ncn-m001# pdsh -w $(./lib/list-ncns.sh| paste -sd,) "uname -r"
+ Getting admin-client-auth secret
+ Obtaining access token
+ Querying SLS
ncn-s003: 5.3.18-24.75-default
ncn-s002: 5.3.18-24.75-default
ncn-s001: 5.3.18-24.75-default
ncn-m001: 5.3.18-24.75-default
ncn-m002: 5.3.18-24.75-default
ncn-m003: 5.3.18-24.75-default
ncn-w003: 5.3.18-24.75-default
ncn-w001: 5.3.18-24.75-default
ncn-w002: 5.3.18-24.75-default
Alternatively, login to each NCN and run the following command to get get currently running kernel version.
ncn# uname -r
Run lib/setup-nexus.sh
to configure Nexus and upload new CSM RPM
repositories, container images, and Helm charts:
ncn-m001# cd "$CSM_DISTDIR"
ncn-m001# ./lib/setup-nexus.sh
On success, setup-nexus.sh
will output OK
on stderr and exit with status
code 0
, e.g.:
ncn-m001# ./lib/setup-nexus.sh
...
+ Nexus setup complete
setup-nexus.sh: OK
ncn-m001# echo $?
0
In the event of an error, consult the known
issues from the install
documentation to resolve potential problems and then try running
setup-nexus.sh
again. Note that subsequent runs of setup-nexus.sh
may
report FAIL
when uploading duplicate assets. This is ok as long as
setup-nexus.sh
outputs setup-nexus.sh: OK
and exits with status code 0
.
Run upgrade.sh
to deploy upgraded CSM applications and services:
ncn-m001# cd "$CSM_DISTDIR"
ncn-m001# ./upgrade.sh
This update includes a new basic UAI image and a new Broker UAI image. The HPE supplied basic UAI image, cray-uai-sles15sp1:latest
simply needs to be updated by pulling it to the NCN worker nodes and restarting the UAI Kubernetes pods that are using it. The following commands ensure that the updated images are used for non-Broker and Broker UAIs:
ncn-m001:~ # pdsh -w ncn-w[000-999] crictl pull dtr.dev.cray.com/cray/cray-uai-sles15sp1:latest 2>&1 | grep -v -e "Could not resolve hostname" -e "ssh exited with exit code 255"
ncn-m001:~ # pdsh -w ncn-w[000-999] crictl pull dtr.dev.cray.com/cray/cray-uai-broker:latest 2>&1 | grep -v -e "Could not resolve hostname" -e "ssh exited with exit code 255"
If you have any UAIs running, you will want to cause them to restart with the new images. If you get a non-empty list back from:
cray uas admin uais list
Then you have UAIs. If you are using Broker UAIs, there will be a mix of Broker and Non-Broker UAIs in the list. If not, you will only have non-Broker UAIs.
The following steps will interrupt any users who are working on UAIs (either through a broker or in legacy mode). To minimize surprise, make sure users are notified that you will be restarting UAIs before proceeding.
To refresh non-Broker UAIs (if you have them):
ncn-m001:~ # kubectl delete po -n user $(kubectl get po -n user | grep "^uai-" | awk '{ print $1 }')
To refresh Broker UAIs (if you have them):
ncn-m001:~ # kubectl delete po -n uas $(kubectl get po -n uas | grep "^uai-" | awk '{ print $1 }')
Finally, this update provides new Compute Node images. If your site uses UAI images built from the Compute Node Image, you will need to build new images and register the new images with UAS, then delete and recreate your running UAIs (if any).
IMPORTANT:
Wait at least 15 minutes afterupgrade.sh
completes to let the various Kubernetes resources get initialized and started.
Run the following validation checks to ensure that everything is still working properly after the upgrade:
Other health checks may be run as desired.
Verify the CSM version has been updated in the product catalog. Verify that the
following command includes version 0.9.5
:
ncn-m001# kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r -j - | jq -r 'to_entries[] | .key' | sort -V
0.9.2
0.9.3
0.9.4
0.9.5
Confirm the import_date
reflects the timestamp of the upgrade:
ncn-m001# kubectl get cm cray-product-catalog -n services -o jsonpath='{.data.csm}' | yq r - '"0.9.5".configuration.import_date'
Remember to exit your typescript.
ncn-m001# exit
It is recommended to save the typescript file for later reference.