Copyright 2021 Hewlett Packard Enterprise Development LP
This guide contains procedures for upgrading systems running CSM 0.9.2 to CSM 0.9.3. It is intended for system installers, system administrators, and network administrators. It assumes some familiarity with standard Linux and associated tooling.
See CHANGELOG.md in the root of a CSM release distribution for a summary of changes in each CSM release.
Procedures:
For convenience, these procedures make use of environment variables. This section sets the expected environment variables to the appropriate values.
Start a typescript to capture the commands and output from this procedure.
ncn-m001# script -af csm-update.$(date +%Y-%m-%d).txt
ncn-m001# export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
Set CSM_DISTDIR
to the directory of the extracted release distribution for
CSM 0.9.3:
NOTE:
Use--no-same-owner
and--no-same-permissions
options totar
when extracting a CSM release distribution asroot
to ensure the extracted files are owned byroot
and have permissions based on the currentumask
value.
ncn-m001# tar --no-same-owner --no-same-permissions -zxvf csm-0.9.3.tar.gz
ncn-m001# CSM_DISTDIR="$(pwd)/csm-0.9.3"
Download and install/upgrade the workaround and documentation RPMs. If this machine does not have direct internet access these RPMs will need to be externally downloaded and then copied to be installed.
ncn-m001# rpm -Uvh https://storage.googleapis.com/csm-release-public/shasta-1.4/docs-csm/docs-csm-latest.noarch.rpm
ncn-m001# rpm -Uvh https://storage.googleapis.com/csm-release-public/shasta-1.4/csm-install-workarounds/csm-install-workarounds-latest.noarch.rpm
Set CSM_RELEASE_VERSION
to the version reported by
${CSM_DISTDIR}/lib/version.sh
:
ncn-m001# CSM_RELEASE_VERSION="$(${CSM_DISTDIR}/lib/version.sh --version)"
Set CSM_SYSTEM_VERSION
to 0.9.2
:
ncn-m001# CSM_SYSTEM_VERSION="0.9.2"
NOTE:
Installed CSM versions may be listed from the product catalog using:ncn-m001# kubectl -n services get cm cray-product-catalog -o jsonpath='{.data.csm}' | yq r -j - | jq -r 'keys[]' | sed '/-/!{s/$/_/}' | sort -V | sed 's/_$//'
It is important to first verify a healthy starting state. To do this, run the CSM validation checks. If any problems are found, correct them and verify the appropriate validation checks before proceeding.
Run lib/setup-nexus.sh
to configure Nexus and upload new CSM RPM
repositories, container images, and Helm charts:
ncn-m001# cd "$CSM_DISTDIR"
ncn-m001# ./lib/setup-nexus.sh
On success, setup-nexus.sh
will output to OK
on stderr and exit with status
code 0
, e.g.:
ncn-m001# ./lib/setup-nexus.sh
...
+ Nexus setup complete
setup-nexus.sh: OK
In the event of an error, consult the known
issues from the install
documentation to resolve potential problems and then try running
setup-nexus.sh
again. Note that subsequent runs of setup-nexus.sh
may
report FAIL
when uploading duplicate assets. This is ok as long as
setup-nexus.sh
outputs setup-nexus.sh: OK
and exits with status code 0
.
Update the coredns
and kube-multus
resources.
Run lib/0.9.3/coredns-bump-resources.sh
ncn-m001# ./lib/0.9.3/coredns-bump-resources.sh
Expected output looks similar to:
Applying new resource limits to coredns pods
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
deployment.apps/coredns configured
Verify that the pods restart with status Running
:
ncn-m001# watch "kubectl get pods -n kube-system -l k8s-app=kube-dns"
Run lib/0.9.3/multus-bump-resources.sh
ncn-m001# ./lib/0.9.3/multus-bump-resources.sh
Expected output looks similar to:
Applying new resource limits to kube-multus pods
daemonset.apps/kube-multus-ds-amd64 configured
Verify that the pods restart with status Running
:
ncn-m001# watch "kubectl get pods -n kube-system -l app=multus"
On success, the coredns
and kube-multus
pods should restart with a status of Running
.
If any kube-multus
pods remain in Terminating
status, force delete them so that the
daemonset can restart them successfully.
ncn-m001# kubectl delete pod <pod-name> -n kube-system --force
ncn-m001# pdsh -w $(./lib/list-ncns.sh | grep ncn-w | paste -sd,) "echo kernel.pty.max=8196 > /etc/sysctl.d/991-maxpty.conf && sysctl -p /etc/sysctl.d/991-maxpty.conf"
Before deploying the manifests, the cray-product-catalog
role in Kubernetes needs to be updated.
a. Display the role before changing it:
```bash
ncn-m001# kubectl get role -n services cray-product-catalog -o json| jq '.rules[0]'
```
Expected output looks like:
```
{
"apiGroups": [
""
],
"resources": [
"configmaps"
],
"verbs": [
"get",
"list",
"update",
"patch"
]
}
```
b. Patch the role:
```bash
ncn-m001# kubectl patch role -n services cray-product-catalog --patch \
'{"rules": [{"apiGroups": [""],"resources": ["configmaps"],"verbs": ["create","get","list","update","patch","delete"]}]}'
```
On success, expected output looks like:
```
role.rbac.authorization.k8s.io/cray-product-catalog patched
```
c. Display the role after the patch:
```bash
ncn-m001# kubectl get role -n services cray-product-catalog -o json| jq '.rules[0]'
```
Expected output looks like:
```
{
"apiGroups": [
""
],
"resources": [
"configmaps"
],
"verbs": [
"create",
"get",
"list",
"update",
"patch",
"delete"
]
}
```
Add a ClusterRoleBinding for cray-unbound-coredns PodSecurityPolicies
a. create cray-unbound-coredns-psp.yaml
with the following contents
```
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cray-unbound-coredns-psp
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: restricted-transition-net-raw-psp
subjects:
- kind: ServiceAccount
name: cray-dns-unbound-manager
namespace: services
- kind: ServiceAccount
name: cray-dns-unbound-coredns
namespace: services
```
b. run kubectl apply -f on cray-unbound-coredns-psp.yaml
```bash
ncn-m001# kubectl apply -f cray-unbound-coredns-psp.yaml
```
Add a ClusterRoleBinding to update the PodSecurityPolicies used by the cray-hms-rts-init job.
Create cray-hms-rts-init-psp.yaml
with the following contents:
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cray-rts-vault-watcher-psp
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: restricted-transition-net-raw-psp
subjects:
- kind: ServiceAccount
name: cray-rts-vault-watcher
namespace: services
Run kubectl apply -f on cray-hms-rts-init-psp.yaml:
ncn-m001# kubectl apply -f cray-hms-rts-init-psp.yaml
Run kubectl delete -n spire job spire-update-bss
to allow the spire chart to be updated properly:
ncn-m001# kubectl delete -n spire job spire-update-bss
Run upgrade.sh
to deploy upgraded CSM applications and services:
ncn-m001# ./upgrade.sh
Note: If you have not already installed the workload manager product including slurm and munge, then the cray-crus
pod is
expected to be in the Init
state. After running upgrade.sh
, you may observe there are now two copies of the cray-crus
pod in
the Init
state. This situation is benign and should resolve itself once the workload manager product is installed.
Upgrade CSM packages on NCNs.
Get the list of NCNs:
ncn-m001# ncns="$(./lib/list-ncns.sh | paste -sd,)"
Use zypper ms -d
to disable the following zypper RIS services that
configure repositories external to the system:
Basesystem_Module_15_SP2_x86_64
Public_Cloud_Module_15_SP2_x86_64
SUSE_Linux_Enterprise_Server_15_SP2_x86_64
Server_Applications_Module_15_SP2_x86_64
ncn-m001# pdsh -w "$ncns" 'zypper ms -d Basesystem_Module_15_SP2_x86_64'
ncn-m001# pdsh -w "$ncns" 'zypper ms -d Public_Cloud_Module_15_SP2_x86_64'
ncn-m001# pdsh -w "$ncns" 'zypper ms -d SUSE_Linux_Enterprise_Server_15_SP2_x86_64'
ncn-m001# pdsh -w "$ncns" 'zypper ms -d Server_Applications_Module_15_SP2_x86_64'
NOTE
: Field notice FN #6615a - Shasta V1.4 and V1.4.1 Install Issue with NCN Personalization for SMA included similar guidance as below. If these zypper services have been previously disabled, verify that they are in fact disabled:ncn-m001# pdsh -w "$ncns" 'zypper ls -u'
Ensure the csm-sle-15sp2
repository is configured on every NCN:
ncn-m001# pdsh -w "$ncns" 'zypper ar -fG https://packages.local/repository/csm-sle-15sp2/ csm-sle-15sp2'
WARNING
: If thecsm-sle-15sp2
repository is already configured on a nodezypper ar
will error with e.g.:Adding repository 'csm-sle-15sp2' [...error] Repository named 'csm-sle-15sp2' already exists. Please use another alias.
These errors may be ignored.
Use zypper up
on each NCN to upgrade installed packages:
ncn-m001# pdsh -w "$ncns" 'zypper up -y'
Run ./lib/0.9.3/enable-psp.sh
to enable PodSecurityPolicy:
ncn-m001# ./lib/0.9.3/enable-psp.sh
Apply the workaround for the following CVEs: CVE-2021-27365, CVE-2021-27364, CVE-2021-27363.
The affected kernel modules are not typically loaded on Shasta NCNs. The following prevents them from ever being loaded.
ncn-m001# pdsh -w $(./lib/list-ncns.sh | paste -sd,) "echo 'install libiscsi /bin/true' >> /etc/modprobe.d/disabled-modules.conf"
CRITICAL:
Only perform the following procedure if$CSM_RELEASE_VERSION >= 0.9.3
.
IMPORTANT:
This procedure applies to systems with CDU switches.
If your Shasta system is using CDU switches you will need to update the configuration going to the CMMs.
See v1.4 Admin Guide for details on updating CMM firmware
This configuration is identical across CDU VSX pairs. The VLANs used here are generated from CSI.
sw-cdu-001(config)# int lag 2 multi-chassis static
sw-cdu-001(config-lag-if)# no shutdown
sw-cdu-001(config-lag-if)# description CMM_CAB_1000
sw-cdu-001(config-lag-if)# no routing
sw-cdu-001(config-lag-if)# vlan trunk native 2000
sw-cdu-001(config-lag-if)# vlan trunk allowed 2000,3000,4091
sw-cdu-001(config-lag-if)# exit
sw-cdu-001(config)# int 1/1/2
sw-cdu-001(config-if)# no shutdown
sw-cdu-001(config-if)# lag 2
sw-cdu-001(config-if)# exit
Dell CDU switch configuration. This configuration is identical across CDU VLT pairs. The VLANs used here are generated from CSI.
interface port-channel1
description CMM_CAB_1000
no shutdown
switchport mode trunk
switchport access vlan 2000
switchport trunk allowed vlan 3000,4091
mtu 9216
vlt-port-channel 1
interface ethernet1/1/1
description CMM_CAB_1000
no shutdown
channel-group 1 mode on
no switchport
mtu 9216
flowcontrol receive on
flowcontrol transmit on
IMPORTANT:
Wait at least 15 minutes afterupgrade.sh
completes to let the various Kubernetes resources get initialized and started.
Run the following validation checks to ensure that everything is still working properly after the upgrade:
Other health checks may be run as desired.
CAUTION:
The following HMS functional tests may fail because of locked components in HSM:
test_bss_bootscript_ncn-functional_remote-functional.tavern.yaml
test_smd_components_ncn-functional_remote-functional.tavern.yaml
Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/tavern/schemas/files.py", line 106, in verify_generic verifier.validate() File "/usr/lib/python3.8/site-packages/pykwalify/core.py", line 166, in validate raise SchemaError(u"Schema validation failed:\n - {error_msg}.".format( pykwalify.errors.SchemaError: <SchemaError: error code 2: Schema validation failed: - Key 'Locked' was not defined. Path: '/Components/0'. - Key 'Locked' was not defined. Path: '/Components/5'. - Key 'Locked' was not defined. Path: '/Components/6'. - Key 'Locked' was not defined. Path: '/Components/7'. - Key 'Locked' was not defined. Path: '/Components/8'. - Key 'Locked' was not defined. Path: '/Components/9'. - Key 'Locked' was not defined. Path: '/Components/10'. - Key 'Locked' was not defined. Path: '/Components/11'. - Key 'Locked' was not defined. Path: '/Components/12'.: Path: '/'>
Failures of these tests because of locked components as shown above can be safely ignored.
Remember to exit your typescript.
ncn-m001# exit
It is recommended to save the typescript file for later reference.