This document guides an administrator through the upgrade of Cray Systems Management from v1.0.x to v1.2.x. When upgrading a system, follow this top-level file from top to bottom. The content on this top-level page is meant to be terse. For additional reference material on the upgrade processes and scripts mentioned explicitly on this page, see resource material.
If upgrading from CSM 1.0.x to 1.2.0, then instead upgrade directly to the latest released version of CSM 1.2.x.
A major feature of CSM 1.2.x is the Bifurcated CAN (BICAN). The BICAN is designed to separate administrative network traffic from user network traffic. For more information, see the BICAN Summary. Review the BICAN Summary before continuing with the CSM 1.2.x upgrade.
For detailed BICAN documentation, see the BICAN Technical Details page.
The SMA Grafana service is temporarily inaccessible during the upgrade.
During stage 3 of the CSM 1.2.x upgrade, the SMA Grafana service will become inaccessible at its previous DNS location. It will remain inaccessible until the upgrade to SMA 1.6.x is applied. This is because of a change in DNS names for the service.
Service request adjustments are needed for small systems.
For systems with only three worker nodes (typically Testing and Development Systems (TDS)), prior to proceeding with this upgrade, CPU limits MUST be lowered on several services in order for this upgrade to succeed. This step is executed automatically as part of Stage 0.4 of the upgrade. See TDS Lower CPU Requests for more information.
Independently, for three-worker systems the customizations.yaml
file is edited automatically during the upgrade, prior to deploying new CSM services. These
settings are contained in /usr/share/doc/csm/upgrade/1.2/scripts/upgrade/tds_cpu_requests.yaml
. This file can be modified (prior to proceeding with this
upgrade), if other settings are desired in the customizations.yaml
file for this system.
For more information about modifying customizations.yaml
and tuning for specific systems, see
Post Install Customizations.
kdump
(kernel dump) may hang and fail on NCNs in CSM 1.2.x (HPE Cray EX System Software 22.07 release). During the upgrade, a workaround is applied to fix this.Prior to CSM 1.2, the single Customer Access Network (CAN) carried both the administrative network traffic and the user network traffic. CSM 1.2 introduces bifurcated CAN (BICAN), which is designed to separate administrative network traffic and user network traffic.
Plan and coordinate network upgrade shows the steps that need to be taken in order to prepare for this network upgrade. Follow these steps in order to plan and coordinate the network upgrade with your users, as well as to ensure undisrupted access to UANs during the upgrade.
Important: Take note of the below content for troubleshooting purposes, in the event that issues are encountered during the upgrade process.
General upgrade troubleshooting
If the execution of the upgrade procedure fails, it is safe to rerun the failed script. If a rerun still fails, wait for 10 seconds and then run it again. If the issue persists, then refer to the below troubleshooting procedures.
General Kubernetes troubleshooting
For general Kubernetes commands for troubleshooting, see Kubernetes Troubleshooting Information.
PXE boot troubleshooting
If execution of the upgrade procedures results in NCNs that have errors booting, then refer to the troubleshooting procedures in the PXE Booting Runbook.
NTP troubleshooting
During upgrades, clock skew may occur when rebooting nodes. If one node is rebooted and its clock differs significantly from those that have not been rebooted, it can
cause contention among the other nodes. Waiting for chronyd
to slowly adjust the clocks can resolve intermittent clock skew issues. This can take up to 15 minutes or
longer. If it does not resolve on its own, then follow the Configure NTP on NCNs procedure to troubleshoot it further.
Bare-metal Etcd recovery
During the upgrade process of the master nodes, if it is found that the bare-metal Etcd cluster (that houses values for the Kubernetes cluster) has a failure, it may be necessary to restore that cluster from backup. See Restore Bare-Metal etcd Clusters from an S3 Snapshot for that procedure.
Bare-metal Etcd certificate
After upgrading, the apiserver-etcd-client
certificate may need to been renewed. See Kubernetes and Bare Metal EtcD Certificate Renewal
for procedures to check and renew this certificate.
Back-ups for etcd-operator
Clusters
After upgrading, if health checks indicate that Etcd pods are not in a healthy/running state, recovery procedures may be needed. See
Backups for etcd-operator
Clusters for these procedures.
Recovering from Postgres database issues
After upgrading, if health checks indicate the Postgres pods are not in a healthy/running state, recovery procedures may be needed. See Troubleshoot Postgres Database for troubleshooting and recovery procedures.
Troubleshooting Spire pods not starting on NCNs
Fixing shared-kafka
kafka
cluster after upgrade
Troubleshoot SLS not working
Rerun a step
When running upgrade scripts, each script records what has been done successfully on a node. This is recorded in the
/etc/cray/upgrade/csm/{CSM_VERSION}/{NAME_OF_NODE}/state
file.
If a rerun is required, the recorded steps to be re-run must be removed from this file.
Here is an example of state file of ncn-m001
:
ncn# cat /etc/cray/upgrade/csm/{CSM_VERSION}/ncn-m001/state
Example output:
[2021-07-22 20:05:27] UNTAR_CSM_TARBALL_FILE
[2021-07-22 20:05:30] INSTALL_CSI
[2021-07-22 20:05:30] INSTALL_WAR_DOC
[2021-07-22 20:13:15] SETUP_NEXUS
[2021-07-22 20:13:16] UPGRADE_BSS <=== Remove this line if you want to rerun this step
[2021-07-22 20:16:30] CHECK_CLOUD_INIT_PREREQ
[2021-07-22 20:19:17] APPLY_POD_PRIORITY
[2021-07-22 20:19:38] UPDATE_BSS_CLOUD_INIT_RECORDS
[2021-07-22 20:19:38] UPDATE_CRAY_DHCP_KEA_TRAFFIC_POLICY
[2021-07-22 20:21:03] UPLOAD_NEW_NCN_IMAGE
[2021-07-22 20:21:03] EXPORT_GLOBAL_ENV
[2021-07-22 20:50:36] PREFLIGHT_CHECK
[2021-07-22 20:50:38] UNINSTALL_CONMAN
[2021-07-22 20:58:39] INSTALL_NEW_CONSOLE