Cray System Management (CSM) - Release Notes

CSM 1.5 contains many changes spanning bug fixes, new feature development, and documentation improvements. This page lists some of the highlights.

New

New software support

  • Resolved CASTS:
    • CAST-31486: cray-powerdns-manager not adding compute nodes to external DNS
    • CAST-33719: Missing entries in haproxy on storage nodes
    • CAST-33922: PXE-E18: Server response timeout
  • Highly available prometheus through integration of thanos
  • Support and migration to bitnami-etcd
  • Updates to all services using etcd cluster to migrate to use of bitnami-etcd
  • Support for old and new spire versions running simultaneously toward zero downtime upgrades
  • Technical Preview Support for spire TPM-based remote node attestation
  • Power Control Service (PCS) is released
  • The V3 Configuration Framework Service (CFS) API is now available, including support for paging, external repository “sources”, and new options for debugging
  • Update of all services using postgres to support new versions of postgres and postgres-operator
  • Boot Orchestration Service (BOS) V2 is the default
  • Update spire version
  • Update version of cert-manager
  • Upgrade to Kubernetes version 1.22
  • Bump iuf-cli version to 1.4.5
  • Upgrade Argo version to pick up bug fixes
  • OPA policies for Multi-tenancy implemented in BOS
  • Create DNAME records in PowerDNS
  • For Bifurcated CAN update kiali to use the new API Gateways
  • For Bifurcated CAN update cray-sysmgmt-health to use the new API Gateways
  • For Bifurcated CAN update gitea-vcs-web to use CMN only Istio Gateway
  • For Bifurcated CAN update gitea-vcs-external to use CMN only Istio Gateway
  • Improved logging in NLS based on StageOutput
  • Image Management Service (IMS) - created an ARM64 version of the barebones recipe
  • Support for large system ARP configuration for first boot and DHCP
  • Hardware discovery process populates a nodes’s architecture
  • System Layout Service (SLS): Added caching to improve performance and robustness
  • Added support for specifying IMS Image and Recipe architecture in Install and Upgrade Framework (IUF)
  • Upgraded node images to SLES15SP5
  • Updated Spire server to work with TPM
  • Added improved error logging for BOS
  • Added a method to stop CFS/Batcher and cancel configuration for in-flight customizations
  • Added support for git submodules in CFS runs
  • Added support for ARM64 builds through emulation
  • Improved IUF Logging
  • Created networking red light / green light dashboard
  • Removed clear text switch passwords from the cray-sysmgmt-health-canu-test pod log
  • Ceph nodes run user facing Docker registry that is writable anonymously
  • Added support for NID allocation defragmentation
  • Multi-tenancy: Vault Transit (KMS) Support for Encrypted Secrets in Version Control Service (VCS)
  • Multi-tenancy: Enable Tenant ID + Tenant Admin AuthZ Awareness for API Ingress (OPA policy)
  • Multi-tenancy: Enable Tenant ID + Tenant Admin AuthZ Awareness for API Ingress
  • Multi-tenancy: BOS support for boot, reboot, and shutdown in tenant
  • Transitioned from cray-heartbeat to csm-node-heartbeat

New hardware support

  • Add switches in Hardware State Manager (HSM) to Power Control Service (PCS) and allow for power reset actions
  • Updated HSM discovery process to populate a node’s architecture when making the information available via Redfish
  • Update to ilorest-4.1.0.0 for Gen11 Support
  • Support for ARM64 added
  • Hardware validation of the EX2500 Cabinet
  • Support JL627A switches as an edge router for BICAN
  • Support for Broadcom PXE boots and interface naming

Improvements

Automation improvements

  • Updates to etcd health checks due to replacement of etcd vendor to bitnami-etcd
  • IUF stage for management-nodes-rollout consumes logs from ncn-rebuild
  • Add a test to check master node taints
  • Augment postgres backup goss test to also check for cronjob
  • Argo-driven upgrade automation for storage nodes
  • Ceph upgrade added to automated storage upgrade

Base platform component upgrades

Platform Component Version
Kubernetes 1.22.13
containerd 1.5.16
istio 1.11.8
thanos 0.31.0
prometheus-operator 0.63.0
grafterm 1.0.3
keycloak 21.1.1
bitnami-etcd on ncn-mxxx 3.5.0
bitnami-etcd for clusters 3.5.9
coredns 1.8.4
helm 3.11.2
postgresql 14.8
postgres-operator 1.8.2
spire 0.12.2
spire-intermediate 1.0.0
cray-spire 1.5.5
metrics-server 0.6.3
cray-certmanager 1.5.5
argo-workflows 3.3.6
argo-workflow-controller 3.4.5
ceph 16.2.13

Security improvements

  • CVE Kernel 5.14.21-150400.24.46.1 for mozilla-nss
  • Removed postgresql from NCNs to fix CVEs
  • Updates to bind-utils, curl, git-core, java-1_8_0-ibm, and less in NCN image for CVEs
  • Updates to libfreebl3-hmac, libfreebl3, tar, and wireshark in NCN image for CVEs
  • Metal-basecamp and cray-site-init dependency updates
  • Additional of kyverno and network policies to ensure some secure controls over mqtt namespace
  • cf-gitea-import: Use CSM-provided Alpine base image to resolve vulnerabilities
  • Updated metacontroller to v4.4.0 to address CVEs
  • Fixed CVE-2023-0386 in CSM 1.5 NCN Images
  • Fixed CVE-2023-32233 in CSM 1.5 NCN Images
  • Developed OPA Policy to force Keycloak admin operations through CMN
  • Updated cfs-ara to 1.0.2 to address CVEs
  • Updated hms-shcd-parser to 1.8.0 to address CVEs
  • Moved istio-ingressgateway-cmn service to use the customer-admin-gateway
  • Kyverno upgrade needed for N-2 support policy
  • Fixed improper certificate validation CVE in cfs-operator
  • Fixed regular expression DoS CVE in cfs-ara
  • Addressed Zenbleed CVE on NCNs
  • Addressed CVE-2023-38545 (curl & libcurl) on NCNs
  • Added default RBAC role for telemetry API
  • SAT: Upgraded paramiko to resolve CVE-2023-48795

Customer-requested enhancements

Documentation enhancements

  • IUF documentation updates for upgrade_all_products issues
  • Addition of a CSM cabling page for the management and edge network
  • Added system recovery procedure for keycloak
  • Updates in several places as a result of migration to bitnami-etcd
  • Update to BOS documentation to replace CAPMC references
  • Update to IUF upgrade with CSM workflow diagram and documentation
  • Update to postgres backup procedures
  • Update to NCN customization and personalization documentation to use sat-bootprep
  • Remove known issue about restored etcd clusters missing PVC
  • Added SNMP setup for all switches to Install/Update instructions
  • Updated Keycloak documentation to use CMN LB for administrative tasks
  • Updated screenshots and documentation steps for LDAP in upgraded Keycloak
  • Updated IUF management-nodes-rollout documentation

Bug fixes

  • Fix for when QLogic adapter firmware stops responding then fails recovery causing the node to crash.
    • Fixes
      • QLogic/Marvel Driver Update - RPM: qlgc-fastlinq-kmp-default-8.74.1.0_k5.14.21_150500.53-1.sles15sp5.x86_64.rpm
      • SUSE Kernel Update: Version: 5.14.21-150500.55.39.1.27360.1.PTF.1215587
  • Fix for invalid preinstall VCS check in IUF in the event of a fresh install
  • Fix deployment failure due to DNS timeouts when max_fails=0 is set in coredns
  • Fixed an issue on upgrade of master NCNs due to not generating the admin-tools keyring
  • Fix for a case where bootprep files are missing when prepare-images stage is run in IUF with one argument
  • Fix IUF issue with SHS error in update-vcs-config stage
  • Ensure IUF stage for management-node-rollout is aborted, also abort ncn-rebuild
  • Fixed issue with Nexus failing to move to another NCN on upgrade
  • Fixed procedure to change root password and SSH keys so it would also work on image customization
  • Update tds_lower_cpu_requests.sh script for opensearch-masters due to CPU it eats
  • Removed subPath-volumeMount in the multus-daemonset to avoid being stuck in termination
  • Improve existing image check logic for ims-python-helper
  • Fix issue with etcd_cluster_balance.sh reporting failure when 3 pods are healthy and 4th is terminating
  • Fixed issue where upgrading the cray-dns-unbound Helm chart should not wipe the DNS records
  • Fixed an incorrectly written Network Policy in cray-drydock for mqtt/spire communication
  • Fixed an issue where restarting kea on large systems wipes DNS records from configmap
  • Updated Unbound to not forward .hsn queries to the site DNS
  • Fixed when cray-externaldns-manager crashes when used with external-dns-0.13
  • Fixed an issue where Weave pods were not starting after upgrading to CSM V1.4 content
  • Fixed PowerDNS server TLD is missing NS delegation records for subdomains
  • Fixed an issue where FRU Tracking doesn’t create a detected event after a removed event

Deprecations

  • The ipv4-resolvers option has been removed for CSI as it is not used
  • CAPMC
  • Removed ARS from Cray CLI and Boot Script Service (BSS) API specification
  • Removed deprecated BOS V1 CFS fields from session templates

For a list of all deprecated CSM features, see Deprecations.

Removals

  • Remove TRS-operator for fresh installs and on upgrades
  • Remove metal-net scripts as they are no longer used
  • Remove etcd-operator as a result of migration to use bitnami-etcd
  • CRUS
  • Deprecated Boot Orchestration Service (BOS) v1 session template and boot set fields are no longer stored in BOS.
  • Removed -P option from cray-dhcp-kea startup options
  • Stopped using skopeo images < 1.13.2

For a list of all features with an announced removal target, see Removals

Known issues

  • Firmware Action Service (FAS) Loader / HFP script post-deliver-product.sh
    • Loading firmware from Nexus using the FAS Loader will intermittently crash with HFP release 23.12 or later. Rerunning the FAS Loader will be required.
    • This affects the HFP script post-deliver-product.sh which will hang when the FAS Loader crashes. Rerunning the script will be required.
    • IUF procedure calls the post-deliver-product.sh script and may require restarting that IUF process.
    • This is expected to be fixed in CSM 1.5.1.
  • Power Control Service (PCS) is unable to place power caps on Blanca Peak (ex254n) and Parry Peak (ex255a) compute nodes
  • cray-tftp-upload
    • The cray-tftp-upload script errors out because of a change to the ipxe pods and the TFTP repository.
    • This error affects the cray-upload-recovery-images script.
    • This is expected to be fixed in CSM 1.5.1.
    • A workaround is presented in Upload BMC Recovery Firmware into TFTP Server
  • check_bios_firmware_versions.sh script does not report valid expected firmware versions.
    • Correct firmware versions are:
      • 1.53, 2.78 or 2.98 for DL325 / DL385 firmware.
      • v1.48, v1.50, v1.69 or v2.84 for DL325 / DL385 BIOS.
      • C38 for Gigabyte BIOS.
    • Update will be in CSM 1.5.1.
  • On large systems, BOS v2 sessions, some CSM health checks, and SAT status may not work properly, because of a failed interaction with CFS.
    • Most of this will be fixed in CSM 1.5.1. The remainder will be fixed in CSM 1.6.0.
    • For more information, including a workaround, see CFS V2 Failures On Large Systems.
  • The hms-discovery job may fail to finish when trying to communicate with non-existent switches. This may happen when incrementally building up a new system and running CSM prior to completion of the full hardware installation.
  • The BOS v2 applystaged endpoint is broken in CSM 1.5. This endpoint is used to execute a rolling reboot.
    • This is expected to be fixed in CSM 1.6.
    • For more information see Rolling reboots.
  • The multi-tenancy feature is broken in CSM 1.5.
    • This is expected to be fixed in CSM 1.5.3.
  • After updating Paradise BMC firmware, the hmcollector-poll service will lose event subscriptions and must be restarted