Copyright 2021 Hewlett Packard Enterprise Development LP
This guide contains procedures for upgrading systems running CSM 0.9.0 to CSM 0.9.2. It is intended for system installers, system administrators, and network administrators. It assumes some familiarity with standard Linux and associated tooling.
NOTE:
CSM 0.9.1 was not officially released so these procedures start with CSM 0.9.0.
See CHANGELOG.md in the root of a CSM release distribution for a summary of changes in each CSM release.
Procedures:
For convenience, these procedures make use of environment variables. This section sets the expected environment variables to the appropriate values.
Start a typescript to capture the commands and output from this procedure.
ncn-m001# script -af csm-update.$(date +%Y-%m-%d).txt
ncn-m001# export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
Set CSM_DISTDIR
to the directory of the extracted release distribution for
CSM 0.9.2:
NOTE:
Use--no-same-owner
and--no-same-permissions
options totar
when extracting a CSM release distribution asroot
to ensure the extracted files are owned byroot
and have permissions based on the currentumask
value.
ncn-m001# tar --no-same-owner --no-same-permissions -zxvf csm-0.9.2.tar.gz
ncn-m001# CSM_DISTDIR="$(pwd)/csm-0.9.2"
Set CSM_RELEASE_VERSION
to the version reported by
${CSM_DISTDIR}/lib/version.sh
:
ncn-m001# CSM_RELEASE_VERSION="$(${CSM_DISTDIR}/lib/version.sh --version)"
Set CSM_SYSTEM_VERSION
to 0.9.0
:
ncn-m001# CSM_SYSTEM_VERSION="0.9.0"
NOTE:
Installed CSM versions may be listed from the product catalog using:
ncn-m001# kubectl -n services get cm cray-product-catalog -o jsonpath='{.data.csm}' | yq r -j - | jq -r 'keys[]' | sed '/-/!{s/$/_/}' | sort -V | sed 's/_$//'
It is important to first verify a healthy starting state. To do this, run the CSM validation checks. If any problems are found, correct them and verify the appropriate validation checks before proceeding.
Before deploying upgraded manifests, customizations.yaml
in the site-init
secret in the loftsman
namespace must be updated.
If the site-init
repository is available as a remote
repository then clone
it on the host orchestrating the upgrade:
ncn-m001# git clone "$SITE_INIT_REPO_URL" site-init
Otherwise, create a new site-init
working tree:
ncn-m001# git init site-init
Download customizations.yaml
:
ncn-m001# kubectl get secrets -n loftsman site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d > site-init/customizations.yaml
Review, add, and commit customizations.yaml
to the local site-init
repository as appropriate.
NOTE:
Ifsite-init
was cloned from a remote repository in step 1, there may not be any differences and hence nothing to commit. This is okay. If there are differences between what is in the repository and what was stored in thesite-init
, then it suggests settings were improperly changed at some point. If that is the case then be cautious, there may be dragons ahead.
ncn-m001# cd site-init
ncn-m001# git diff
ncn-m001# git add customizations.yaml
ncn-m001# git commit -m 'Add customizations.yaml from site-init secret'
Update customizations.yaml
.
ncn-m001# yq d -i customizations.yaml spec.kubernetes.services.cray-sysmgmt-health.prometheus-operator.prometheus.prometheusSpec.resources
Review the changes to customizations.yaml
and verify baseline system
customizations
and any customer-specific settings are correct.
ncn-m001# git diff
Add and commit customizations.yaml
if there are any changes:
ncn-m001# git add customizations.yaml
ncn-m001# git commit -m "Update customizations.yaml consistent with CSM $CSM_RELEASE_VERSION"
Update site-init
sealed secret in loftsman
namespace:
ncn-m001# kubectl delete secret -n loftsman site-init
ncn-m001# kubectl create secret -n loftsman generic site-init --from-file=customizations.yaml
Push to the remote repository as appropriate:
ncn-m001# git push
Run lib/setup-nexus.sh
to configure Nexus and upload new CSM RPM
repositories, container images, and Helm charts:
ncn-m001# cd "$CSM_DISTDIR"
ncn-m001# ./lib/setup-nexus.sh
On success, setup-nexus.sh
will output to OK
on stderr and exit with status
code 0
, e.g.:
ncn-m001# ./lib/setup-nexus.sh
...
+ Nexus setup complete
setup-nexus.sh: OK
In the event of an error, consult the known
issues from the install
documentation to resolve potential problems and then try running
setup-nexus.sh
again. Note that subsequent runs of setup-nexus.sh
may
report FAIL
when uploading duplicate assets. This is ok as long as
setup-nexus.sh
outputs setup-nexus.sh: OK
and exits with status code 0
.
Run kubectl delete -n spire job spire-update-bss
to allow the spire chart to be updated properly:
ncn-m001# kubectl delete -n spire job spire-update-bss
Run upgrade.sh
to deploy upgraded CSM applications and services:
ncn-m001# ./upgrade.sh
Upgrade packages on NCNs.
Get the list of NCNs:
ncn-m001# ncns="$(./lib/list-ncns.sh | paste -sd,)"
Use zypper ms -d
to disable the following zypper RIS services that
configure repositories external to the system:
Basesystem_Module_15_SP2_x86_64
Public_Cloud_Module_15_SP2_x86_64
SUSE_Linux_Enterprise_Server_15_SP2_x86_64
Server_Applications_Module_15_SP2_x86_64
ncn-m001# pdsh -w "$ncns" 'zypper ms -d Basesystem_Module_15_SP2_x86_64'
ncn-m001# pdsh -w "$ncns" 'zypper ms -d Public_Cloud_Module_15_SP2_x86_64'
ncn-m001# pdsh -w "$ncns" 'zypper ms -d SUSE_Linux_Enterprise_Server_15_SP2_x86_64'
ncn-m001# pdsh -w "$ncns" 'zypper ms -d Server_Applications_Module_15_SP2_x86_64'
NOTE
: Field notice FN #6615a - Shasta V1.4 and V1.4.1 Install Issue with NCN Personalization for SMA included similar guidance as below. If these zypper services have been previously disabled, verify that they are in fact disabled:
ncn-m001# pdsh -w "$ncns" 'zypper ls -u'
Ensure the csm-sle-15sp2
repository is configured on every NCN:
ncn-m001# pdsh -w "$ncns" 'zypper ar -fG https://packages.local/repository/csm-sle-15sp2/ csm-sle-15sp2'
WARNING
: If thecsm-sle-15sp2
repository is already configured on a nodezypper ar
will error with e.g.:
Adding repository 'csm-sle-15sp2' [...error] Repository named 'csm-sle-15sp2' already exists. Please use another alias.
These errors may be ignored.
Install the hpe-csm-scripts
package on each NCN:
ncn-m001# pdsh -w "$ncns" 'zypper in -y hpe-csm-scripts'
Previous installs of CSM and other Cray products created git repositories in
the VCS service which were set to be publicly visible. To enhance security,
please follow the instructions in the Admin guide, chapter 12, “Version Control
Service (VCS)” section to switch the visibility of all *-config-management
repositories to private.
Future installations of configuration content into Gitea by CSM and other Cray products will create or patch repositories to private visibility automatically.
As a result of this change, git clone
operations will now require
credentials. CSM services that clone repositories have been upgraded to use the
crayvcs
user to clone repositories.
Three new Prometheus alert definitions have been added in CSM 0.9.1 for
monitoring replication across Postgres instances, which are used by some system
management services. The new alerts are PostgresqlReplicationLagSMA
(for
Postgres instances in the sma
namespace), PostgresqlReplicationLagServices
(for Postgres instances in all other namespaces), and
PostgresqlInactiveReplicationSlot
.
In the event that a state of broken Postgres replication persists to the extent that space allocated for its WAL files fills-up, the affected database will likely shut down and create a state where it cannot be brought up again. This can impact the reliability of the related service and can require that it be redeployed with data re-population procedures.
To avoid this unexpected, but possible event, it is recommended that all administrators configure Prometheus alert notifications for the early detection of Postgres replication lag and, if notified, swiftly follow the suggested remediation actions (to avoid service down-time).
Please access the relevant sections of the 1.4 HPE Cray EX System Administration Guide for information about how to configure Prometheus Alert Notifications (“System Management Health Checks and Alerts” sub-section under “Monitor the System”) and how to re-initialize a Postgres cluster encountering signs of replication lag (“About Postgres” sub-section under “Kubernetes Architecture”).
IMPORTANT:
Wait at least 15 minutes afterupgrade.sh
completes to let the various Kubernetes resources get initialized and started.
Run the following validation checks to ensure that everything is still working properly after the upgrade:
Other health checks may be run as desired.
CAUTION:
The following HMS functional tests may fail because of locked components in HSM:
test_bss_bootscript_ncn-functional_remote-functional.tavern.yaml
test_smd_components_ncn-functional_remote-functional.tavern.yaml
Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/tavern/schemas/files.py", line 106, in verify_generic verifier.validate() File "/usr/lib/python3.8/site-packages/pykwalify/core.py", line 166, in validate raise SchemaError(u"Schema validation failed:\n - {error_msg}.".format( pykwalify.errors.SchemaError: <SchemaError: error code 2: Schema validation failed: - Key 'Locked' was not defined. Path: '/Components/0'. - Key 'Locked' was not defined. Path: '/Components/5'. - Key 'Locked' was not defined. Path: '/Components/6'. - Key 'Locked' was not defined. Path: '/Components/7'. - Key 'Locked' was not defined. Path: '/Components/8'. - Key 'Locked' was not defined. Path: '/Components/9'. - Key 'Locked' was not defined. Path: '/Components/10'. - Key 'Locked' was not defined. Path: '/Components/11'. - Key 'Locked' was not defined. Path: '/Components/12'.: Path: '/'>
Failures of these tests because of locked components as shown above can be safely ignored.
IMPORTANT:
This procedure applies to systems with Aruba management switches.
If your Shasta system is using Aruba management switches run the updated BGP
script /opt/cray/csm/scripts/networking/BGP/Aruba_BGP_Peers.py
.
Set the SWITCH_IPS
variable to an array containing the IP addresses of the switches.
EXAMPLE:
: The following can be used to determine the IP addresses of the switches running BGP:
ncn-m001# kubectl get cm config -n metallb-system -o yaml | head -12 apiVersion: v1 data: config: | peers: - peer-address: 10.252.0.2 peer-asn: 65533 my-asn: 65533 - peer-address: 10.252.0.3 peer-asn: 65533 my-asn: 65533 address-pools: - name: customer-access
In the above output
10.252.0.2
and10.252.0.3
are the switches running BGP. SetSWITCH_IPS
as follows:
ncn-m001# SWITCH_IPS=( 10.252.0.2 10.252.0.3 )
Run:
ncn-m001# /opt/cray/csm/scripts/networking/BGP/Aruba_BGP_Peers.py "${SWITCH_IPS[@]}"
Remove the static routes configured in LAYER3-CONFIG. Log into the switches running BGP (Spines/Aggs) and remove them:
Note
: To view the current static routes setup on the switch run the following
sw-spine01# show ip route static Displaying ipv4 routes selected for forwarding '[x/y]' denotes [distance/metric] 0.0.0.0/0, vrf default via 10.103.15.161, [1/0], static 10.92.100.60/32, vrf default via 10.252.1.10, [1/0], static 10.94.100.60/32, vrf default via 10.252.1.10, [1/0], static
In the above example the static routes that need to be removed point to
10.252.1.10
sw-spine-001(config)# no ip route 10.92.100.60/32 10.252.1.10
sw-spine-001(config)# no ip route 10.94.100.60/32 10.252.1.10
Verify the BGP configuration.
Check to see if firmware is loaded into FAS:
ncn-m001# cray fas images list | grep cc.1.4.19
If firmware not installed, rerun the FAS loader:
ncn-w001# kubectl -n services get jobs | grep fas-loader
cray-fas-loader-1 1/1 8m57s 7d15h
NOTE:
In the above example, the returned job name iscray-fas-loader-1
, hence that is the job to rerun.
ncn-m001# kubectl -n services get job cray-fas-loader-1 -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels."controller-uid")' | kubectl replace --force -f -
When completed, verify the firmware was loaded into FAS:
ncn-m001# cray fas images list | grep cc.1.4.19
Update the Chassis Controller BMC Firmware:
Power off the chassis slots.
Disable the hms-discovery job:
ncn-m001# kubectl -n services patch cronjobs hms-discovery -p '{"spec":{"suspend":true}}'
Power off all the components: for example, in chassis 0-7. cabinets 1000-1003:
ncn-m001# cray capmc xname_off create --xnames x[1000-1003]c[0-7] --recursive true --continue true
Create an upgrade JSON file ccBMCupdate.json
:
{
"inventoryHardwareFilter": {
"manufacturer": "cray"
},
"stateComponentFilter": {
"deviceTypes": [
"chassisBMC"
]
},
"targetFilter": {
"targets": [
"BMC"
]
},
"command": {
"version": "latest",
"tag": "default",
"overrideDryrun": false,
"restoreNotPossibleOverride": true,
"timeLimit": 1000,
"description": "Dryrun upgrade of Cray Chassis Controllers"
}
}
Using the above JSON file run a dry-run with FAS:
ncn-w001# cray fas actions create ccBMCupdate.json
Check the output from the dry-run with the command: cray fas actions describe {action-id}
(where action-id
was the actionId
returned for the
fas actions create
command)
If dry-run succeeded with updates to version 1.4.19, change
"overrideDryrun"
in the above JSON file to true
and update the description.
Rerun FAS with the updated JSON file to do the actual updates.
After firmware update completes, restart the hms-discovery cronjob:
ncn-m001 # kubectl -n services patch cronjobs hms-discovery -p '{"spec" : {"suspend" : false }}'
The hms-discovery cronjob will run within 5 minutes of being unsuspended and start powering on the chassis enclosures, switches, and compute blades. If components are not being powered back on, then power them on manually:
ncn-m001 # cray capmc xname_on create --xnames x[1000-1003]c[0-7]r[0-7],x[1000-1003]c[0-7]s[0-7] --prereq true --continue true
The --prereq
option ensures all required components are powered on first.
The --continue
option allows the command to complete in systems without
fully populated hardware.
After the components have powered on, boot the nodes using the Boot Orchestration Services (BOS).
Remember to exit your typescript.
ncn-m001# exit
It is recommended to save the typescript file for later reference.