Cray System Management (CSM) - Release Notes
CSM 1.5 contains many changes spanning bug fixes, new feature development, and
documentation improvements. This page lists some of the highlights.
New
New software support
- Resolved CASTS:
CAST-31486: cray-powerdns-manager not adding compute nodes to external DNS
CAST-33719: Missing entries in haproxy on storage nodes
CAST-33922: PXE-E18: Server response timeout
- Highly available
prometheus through integration of thanos
- Support and migration to
bitnami-etcd
- Updates to all services using
etcd cluster to migrate to use of bitnami-etcd
- Support for old and new
spire versions running simultaneously toward zero downtime upgrades
- Technical Preview Support for
spire TPM-based remote node attestation
- Power Control Service (PCS) is released
- The V3 Configuration Framework Service (CFS) API is now available,
including support for paging, external repository “sources”, and new options for debugging
- Update of all services using
postgres to support new versions of postgres and postgres-operator
- Boot Orchestration Service (BOS) V2 is the default
- Update
spire version
- Update version of
cert-manager
- Upgrade to Kubernetes version 1.22
- Bump
iuf-cli version to 1.4.5
- Upgrade Argo version to pick up bug fixes
- OPA policies for Multi-tenancy implemented in BOS
- Create DNAME records in PowerDNS
- For Bifurcated CAN update
kiali to use the new API Gateways
- For Bifurcated CAN update
cray-sysmgmt-health to use the new API Gateways
- For Bifurcated CAN update
gitea-vcs-web to use CMN only Istio Gateway
- For Bifurcated CAN update
gitea-vcs-external to use CMN only Istio Gateway
- Improved logging in NLS based on
StageOutput
- Image Management Service (IMS) - created an ARM64 version of the barebones recipe
- Support for large system ARP configuration for first boot and DHCP
- Hardware discovery process populates a nodes’s architecture
- System Layout Service (SLS): Added caching to improve performance and robustness
- Added support for specifying IMS Image and Recipe architecture in Install and Upgrade Framework (IUF)
- Upgraded node images to SLES15SP5
- Updated Spire server to work with TPM
- Added improved error logging for BOS
- Added a method to stop
CFS/Batcher and cancel configuration for in-flight customizations
- Added support for git
submodules in CFS runs
- Added support for ARM64 builds through emulation
- Improved IUF Logging
- Created networking red light / green light dashboard
- Removed clear text switch passwords from the
cray-sysmgmt-health-canu-test pod log
- Ceph nodes run user facing Docker registry that is writable anonymously
- Added support for NID allocation defragmentation
- Multi-tenancy: Vault Transit (KMS) Support for Encrypted Secrets in Version Control Service (VCS)
- Multi-tenancy: Enable Tenant ID + Tenant Admin
AuthZ Awareness for API Ingress (OPA policy)
- Multi-tenancy: Enable Tenant ID + Tenant Admin
AuthZ Awareness for API Ingress
- Multi-tenancy: BOS support for boot, reboot, and shutdown in tenant
- Transitioned from
cray-heartbeat to csm-node-heartbeat
New hardware support
- Add switches in Hardware State Manager (HSM) to
Power Control Service (PCS) and allow for power reset actions
- Updated HSM discovery process to populate a node’s architecture when making the information available via Redfish
- Update to
ilorest-4.1.0.0 for Gen11 Support
- Support for ARM64 added
- Hardware validation of the EX2500 Cabinet
- Support JL627A switches as an edge router for BICAN
- Support for Broadcom PXE boots and interface naming
Improvements
Automation improvements
- Updates to
etcd health checks due to replacement of etcd vendor to bitnami-etcd
- IUF stage for
management-nodes-rollout consumes logs from ncn-rebuild
- Add a test to check master node taints
- Augment
postgres backup goss test to also check for cronjob
- Argo-driven upgrade automation for storage nodes
- Ceph upgrade added to automated storage upgrade
| Platform Component |
Version |
| Kubernetes |
1.22.13 |
containerd |
1.5.16 |
istio |
1.11.8 |
thanos |
0.31.0 |
prometheus-operator |
0.63.0 |
grafterm |
1.0.3 |
keycloak |
21.1.1 |
bitnami-etcd on ncn-mxxx |
3.5.0 |
bitnami-etcd for clusters |
3.5.9 |
coredns |
1.8.4 |
helm |
3.11.2 |
postgresql |
14.8 |
postgres-operator |
1.8.2 |
spire |
0.12.2 |
spire-intermediate |
1.0.0 |
cray-spire |
1.5.5 |
metrics-server |
0.6.3 |
cray-certmanager |
1.5.5 |
argo-workflows |
3.3.6 |
argo-workflow-controller |
3.4.5 |
ceph |
16.2.13 |
Security improvements
- CVE Kernel 5.14.21-150400.24.46.1 for
mozilla-nss
- Removed
postgresql from NCNs to fix CVEs
- Updates to
bind-utils, curl, git-core, java-1_8_0-ibm, and less in NCN image for CVEs
- Updates to
libfreebl3-hmac, libfreebl3, tar, and wireshark in NCN image for CVEs
Metal-basecamp and cray-site-init dependency updates
- Additional of
kyverno and network policies to ensure some secure controls over mqtt namespace
cf-gitea-import: Use CSM-provided Alpine base image to resolve vulnerabilities
- Updated
metacontroller to v4.4.0 to address CVEs
- Fixed
CVE-2023-0386 in CSM 1.5 NCN Images
- Fixed
CVE-2023-32233 in CSM 1.5 NCN Images
- Developed OPA Policy to force Keycloak admin operations through CMN
- Updated
cfs-ara to 1.0.2 to address CVEs
- Updated
hms-shcd-parser to 1.8.0 to address CVEs
- Moved
istio-ingressgateway-cmn service to use the customer-admin-gateway
- Kyverno upgrade needed for
N-2 support policy
- Fixed improper certificate validation CVE in
cfs-operator
- Fixed regular expression DoS CVE in
cfs-ara
- Addressed
Zenbleed CVE on NCNs
- Addressed
CVE-2023-38545 (curl & libcurl) on NCNs
- Added default RBAC role for telemetry API
- SAT: Upgraded
paramiko to resolve CVE-2023-48795
Customer-requested enhancements
Documentation enhancements
- IUF documentation updates for
upgrade_all_products issues
- Addition of a CSM cabling page for the management and edge network
- Added system recovery procedure for
keycloak
- Updates in several places as a result of migration to
bitnami-etcd
- Update to BOS documentation to replace CAPMC references
- Update to IUF upgrade with CSM workflow diagram and documentation
- Update to
postgres backup procedures
- Update to NCN customization and personalization documentation to use
sat-bootprep
- Remove known issue about restored etcd clusters missing PVC
- Added SNMP setup for all switches to Install/Update instructions
- Updated Keycloak documentation to use CMN LB for administrative tasks
- Updated screenshots and documentation steps for LDAP in upgraded Keycloak
- Updated IUF management-nodes-rollout documentation
Bug fixes
- Fix for when QLogic adapter firmware stops responding then fails recovery causing the node to crash.
- Fixes
- QLogic/Marvel Driver Update -
RPM:
qlgc-fastlinq-kmp-default-8.74.1.0_k5.14.21_150500.53-1.sles15sp5.x86_64.rpm
- SUSE Kernel Update: Version:
5.14.21-150500.55.39.1.27360.1.PTF.1215587
- Fix for invalid
preinstall VCS check in IUF in the event of a fresh install
- Fix deployment failure due to DNS timeouts when
max_fails=0 is set in coredns
- Fixed an issue on upgrade of master NCNs due to not generating the
admin-tools keyring
- Fix for a case where
bootprep files are missing when prepare-images stage is run in IUF with one argument
- Fix IUF issue with SHS error in
update-vcs-config stage
- Ensure IUF stage for
management-node-rollout is aborted, also abort ncn-rebuild
- Fixed issue with Nexus failing to move to another NCN on upgrade
- Fixed procedure to change root password and SSH keys so it would also work on image customization
- Update
tds_lower_cpu_requests.sh script for opensearch-masters due to CPU it eats
- Removed
subPath-volumeMount in the multus-daemonset to avoid being stuck in termination
- Improve existing image check logic for
ims-python-helper
- Fix issue with
etcd_cluster_balance.sh reporting failure when 3 pods are healthy and 4th is terminating
- Fixed issue where upgrading the
cray-dns-unbound Helm chart should not wipe the DNS records
- Fixed an incorrectly written Network Policy in
cray-drydock for mqtt/spire communication
- Fixed an issue where restarting
kea on large systems wipes DNS records from configmap
- Updated Unbound to not forward
.hsn queries to the site DNS
- Fixed when
cray-externaldns-manager crashes when used with external-dns-0.13
- Fixed an issue where Weave pods were not starting after upgrading to CSM V1.4 content
- Fixed PowerDNS server TLD is missing NS delegation records for subdomains
- Fixed an issue where FRU Tracking doesn’t create a detected event after a removed event
- Fixed an issue where
kdump would freeze while writing the vmcore and other dump files
Deprecations
- The
ipv4-resolvers option has been removed for CSI as it is not used
- CAPMC
- Removed ARS from Cray CLI and Boot Script Service (BSS) API specification
- Removed deprecated BOS V1 CFS fields from session templates
For a list of all deprecated CSM features, see Deprecations.
Removals
- Remove
TRS-operator for fresh installs and on upgrades
- Remove
metal-net scripts as they are no longer used
- Remove
etcd-operator as a result of migration to use bitnami-etcd
- CRUS
- Deprecated Boot Orchestration Service (BOS) v1 session template and boot
set fields are no longer stored in BOS.
- Removed -P option from
cray-dhcp-kea startup options
- Stopped using
skopeo images < 1.13.2
For a list of all features with an announced removal target,
see Removals
Known issues
- Image build and customization jobs generally take longer than they did in previous CSM versions. For more details, see
Image Job Performance.
- Firmware Action Service (FAS) Loader / HFP script
post-deliver-product.sh
- Loading firmware from Nexus using the FAS Loader will intermittently crash with HFP release 23.12 or later. Rerunning the FAS Loader will be required.
- This affects the HFP script
post-deliver-product.sh which will hang when the FAS Loader crashes. Rerunning the script will be required.
- IUF procedure calls the
post-deliver-product.sh script and may require restarting that IUF process.
- This is expected to be fixed in CSM 1.5.1.
- Power Control Service (PCS) is unable to place power caps on Blanca Peak (
ex254n) and Parry Peak (ex255a) compute nodes
cray-tftp-upload
- The
cray-tftp-upload script errors out because of a change to the ipxe pods and the TFTP repository.
- This error affects the
cray-upload-recovery-images script.
- This is expected to be fixed in CSM 1.5.1.
- A workaround is presented in Upload BMC Recovery Firmware into TFTP Server
check_bios_firmware_versions.sh script does not report valid expected firmware versions.
- Correct firmware versions are:
- 1.53, 2.78 or 2.98 for DL325 / DL385 firmware.
v1.48, v1.50, v1.69 or v2.84 for DL325 / DL385 BIOS.
- C38 for Gigabyte BIOS.
- Update will be in CSM 1.5.1.
- On large systems, BOS v2 sessions, some CSM health checks, and SAT status may not work properly, because of a failed interaction with CFS.
- Most of this will be fixed in CSM 1.5.1. The remainder will be fixed in CSM 1.6.0.
- For more information, including a workaround, see CFS V2 Failures On Large Systems.
- The hms-discovery job may fail to finish when trying to communicate with non-existent switches. This may happen when incrementally building up a new system and running CSM prior to completion of the full hardware installation.
- The BOS v2
applystaged endpoint is broken in CSM 1.5. This endpoint is used to execute a rolling reboot.
- This is expected to be fixed in CSM 1.6.
- For more information see Rolling reboots.
- The multi-tenancy feature is broken in CSM 1.5.
- This is expected to be fixed in CSM 1.5.3.
- After updating Paradise BMC firmware, the
hmcollector-poll service will lose event subscriptions and must be restarted
- There are resource leaks in several HMS services (PCS, SMD, hmcollector, and FAS)
- This issue is partially resolved by a hotfix for the CSM 1.5.2 release and fully resolved in the CSM 1.5.3 and 1.6.1 releases
- For more information, including a workaround, see HMS Resource Leaks.
sat bootprep image customization error
- PostgreSQL clusters may fall into
SyncFailed state after upgrade from CSM 1.4 to 1.5 due to Kyverno webhook conflicts
For a full list of known issues, see Known issues.