This document is intended to guide an administrator through the upgrade process going to Cray Systems Management v1.0.1
from v0.9
(v0.9.4
or later) or v1.0.0
.
When upgrading a system, this top-level README.md
file should be followed top to bottom, and the content on this top level page is meant to be terse.
See the additional files in the various directories under the resource_material
directory for additional reference material in support of the process/scripts mentioned explicitly on this page.
Throughout the guide the terms “stable” and “upgrade” are used in the context of the management nodes (NCNs). The “stable” NCN is the master node from which all of these commands will be run and therefore cannot have its power state affected. Then the “upgrade” node is the node to next be upgraded.
When doing a rolling upgrade of the entire cluster, at some point you will need to transfer the responsibility of the “stable” NCN to another master node. However, you do not need to do this before you are ready to upgrade that node.
Important:
Please take note of the below content for troubleshooting purposes in the case that you encounter issues.
Please see Kubernetes Troubleshooting Information.
If execution of the upgrade procedures results in NCNs that have errors booting, please refer to these troubleshooting procedures: PXE Booting Runbook
During execution of the upgrade procedure, if it is noted that there is clock skew on one or more NCNs, the following procedure can be used to troubleshoot NTP configuration or to sync time: Configure NTP on NCNs
If in the upgrade process of the master nodes, it is found that the bare-metal etcd cluster (that houses values for the Kubernetes cluster) has a failure, it may be necessary to restore that cluster from back-up. Please see Restore Bare-Metal etcd Clusters from an S3 Snapshot for that procedure.
After upgrading, apiserver-etcd-client
certificate may need to been renewed. See Kubernetes and Bare Metal EtcD Certificate Renewal
for procedures to check and renew this certificate.
After upgrading, if health checks indicate that etcd pods are not in a healthy/running state, recovery procedures may be needed. Please see Backups for etcd-operator Clusters for these procedures.
After upgrading, if health checks indicate the Postgres pods are not in a healthy/running state, recovery procedures may be needed. Please see Troubleshoot Postgres Database for troubleshooting and recovery procedures.
Please see Troubleshoot Spire Failing to Start on NCNs.
When running upgrade scripts, each script record what has been done successfully on a node. This state
file is stored at /ect/cray/upgrade/csm/{CSM_VERSION}/{NAME_OF_NODE}/state
.
If a rerun is required, you will need to remove the recorded steps from this file.
Here is an example of state file of ncn-m001
:
ncn-m001:~ # cat /etc/cray/upgrade/csm/{CSM_VERSION}/ncn-m001/state
[2021-07-22 20:05:27] UNTAR_CSM_TARBALL_FILE
[2021-07-22 20:05:30] INSTALL_CSI
[2021-07-22 20:05:30] INSTALL_WAR_DOC
[2021-07-22 20:13:15] SETUP_NEXUS
[2021-07-22 20:13:16] UPGRADE_BSS <=== Remove this line if you want to rerun this step
[2021-07-22 20:16:30] CHECK_CLOUD_INIT_PREREQ
[2021-07-22 20:19:17] APPLY_POD_PRIORITY
[2021-07-22 20:19:38] UPDATE_BSS_CLOUD_INIT_RECORDS
[2021-07-22 20:19:38] UPDATE_CRAY_DHCP_KEA_TRAFFIC_POLICY
[2021-07-22 20:21:03] UPLOAD_NEW_NCN_IMAGE
[2021-07-22 20:21:03] EXPORT_GLOBAL_ENV
[2021-07-22 20:50:36] PREFLIGHT_CHECK
[2021-07-22 20:50:38] UNINSTALL_CONMAN
[2021-07-22 20:58:39] INSTALL_NEW_CONSOLE