Cray System Management (CSM) - Release Notes
CSM 1.3 contains approximately 500 changes spanning bug fixes, new feature development, and documentation improvements. This page lists some of the highlights.
New
Monitoring
- Temperature hardware monitoring dashboard for NCNs
- Support for export of SNMP data from multiple switches for population of SNMP Export
grafana
panel
- Space monitoring improvements - included volumes other than root file system
DHCP
Changed in DHCP
- Kea: Fixed a bug that could cause the auto-repair logic to fail due to an incorrect index
DNS
Changed in DNS
ExternalDNS
: Fixed a bug where cray-externaldns-manager
could panic if it couldn’t connect to PowerDNS on startup
- PowerDNS: Changed
powerdns-manager
SLS error message to debug
- PowerDNS: Fixed bug that caused
powerdns-manager
and externaldns-manager
to update the same record
- PowerDNS: Fixed a bug that could cause
powerdns-manager
to intermittently crash when performing a lookup for an existing TSIG key
- PowerDNS:
powerdns-manager
will now retry attempts to add a TSIG key
- PowerDNS:
powerdns-manager
will now create PTR records that are not created by external-dns
- Unbound: Changed
cray-dns-unbound
MaxUnavailable
default from 0 to 1 to avoid issues when evicting pods from NCNs
Management Network
Added in Management Network
- Documentation: Added procedure to migrate from the customer access network (CAN) to the customer high-speed network (CHN), allowing user traffic over
the HSN instead of the NMN (This is an extension of the
bi-furcated
CAN feature that shipped in CSM 1.2)
- Documentation: Added various troubleshooting guides
- Hardware: Validated the Aruba 8360 (JL705C) switch for the management network
Changed in Management Network
- CANU: Fixed bug where UAN VLANs in generated switch configurations were wrong when using the CHN
- CANU: Other various bug fixes (see CANU
Github
Page for a full changelog)
- CANU: Added an ACL on systems with Dell and Mellanox switches to prevent high-speed network (HSN) switches on the Hardware Management Network (HMN)
from communicating with the Fabric Manager service on the Node Management Network (NMN) API gateway
- CSI: Fixed bug where CSI could generate bad SLS chassis data
- Documentation: Admins are now asked to updated CANU to the latest version when beginning a CSM upgrade
- Documentation: Management network switch upgrade instructs are now separate from the CSM upgrade procedure
- Documentation: Removed stale reference to Kea
postgres
from troubleshooting documentation
- Documentation: Fixed invalid commands in
generate_switch_configs.md
Management Nodes (Ceph, Kubernetes Workers, and Kubernetes Managers)
Added in Management Nodes
- ALL: Initial support for NVME drives
Changed in Management Nodes
- All: Updated kernel to
kernel-default-5.3.18-150300.59.87.1
- All: Various package updates to apply latest security patches
- All:
kdump
is now more reliable and remains functional after IMS image customization
- All: S3 now requires authentication to retrieve Management Node OS images
- All:
metal.no-wipe=0
is now more reliable when running during a net boot
- All: SSH keys must now be injected into images
- All: Time synchronization is now more reliable on initial configuration
- All: Pressure Stall Information (PSI) is now enabled by default (see related CSM 1.3 docs)
- Documentation: Various updates to installation guides
User Application Service (UAS) and User Application Instances (UAI)
User Application Nodes (UAN)
Added in UAN
- UAN: Initial release of UAN images based on kernels without modifications (technical preview)
- Documentation: Added procedure for re-purposing compute nodes as UAN’s (only applicable in specific scenarios)
- Documentation: Added instructions to set/trim the boot order on UANs
Changed in UAN
- UAN: Pressure Stall Information (PSI) is now enabled by default on COS-based images
- UAN: Updated to the latest COS image
- UAN: Network changes related to the CAN and CHN may impact VLAN tagging on management network ports connected to UANs to ensure proper network traffic segregation.
Miscellaneous functionality
- Integrated Kyverno Native Policy Management engine
- Ansible has been added to NCNs
- Added support for procedures:
- Replace/Remove/Add NCNs
- Add River cabinets
- Integrated Argo server workflow engine for Kubernetes
- Technology Preview: BOS V2
- Asynchronous boot state handling and CRUS replacement for rolling upgrades
- Technology Preview: Tenant and Partition Management Service (TAPMS)
- Added support for using SCSD to enable or disable TPM BIOS setting on Gigabyte and HPE hardware
- Boot NCNs using private S3 bucket
- Enable IMS recipe templating to allow for dynamic repository selection
- CSM health check performance improvements
- HMS tests now execute in parallel using Helm Test
- NCN and Kubernetes health checks now execute in parallel and eliminate lengthy output for tests that pass
- Included SAT CLI in CSM (see SAT in CSM)
New hardware support
- Aruba JL705C, JL706C, JL707C management network switches
- Milan-based DL325 as a Compute Node
- Olympus Antero Blade (AMD Genoa) with Slingshot 11
Automation improvements
- Support for Argo-driven upgrade of multiple Kubernetes Worker NCNs in parallel (Tech Preview)
- Support for Argo-driven rebuild of multiple Kubernetes Worker NCNs in parallel (Tech Preview)
- Support for Argo-driven upgrade of Storage NCNs, serially
- Ceph upgrade is now driven using a utility called the
cubs_tool
.
- Re-organization of
goss
test execution during installs and upgrades to remove duplicated tests
- Improvement of
goss
test suite output to display summary of failing tests
- Removed manual prompts from upgrade of storage NCNs
- Introduced weekly spire-intermediate cron-job to check CA to see when it needs automatic renewal
Platform Component |
Version |
Ceph |
16.2.9 |
containerd |
1.5.12 |
coredns |
1.8.0 |
cray-dhcp-kea |
0.10.15 |
cray-ipxe |
1.11.0 |
Istio |
1.10.6 |
Kubernetes |
1.21.12 |
Kiali |
1.36.7 |
Nexus |
3.38.0-1 |
podman |
3.4.4 |
postgreSQL |
12.12 |
Prometheus |
2.36.1 |
oauth2-proxy |
7.3.0 |
cray-opa |
0.42.1 |
cray-velero |
1.6.3-2 |
Security improvements
- Boot Security - Randomized iPXE File Name
- Boot Security - NCN boots via pre-signed S3 URLs
- API least privileges (xname filtering)
- Role Based Access Control (RBAC) Role for monitoring
- Kubernetes Pod Runtime Security – Phase 1 (non-root)
- Kubernetes API (etcd) Encryption (opt-in)
- Tenant and Partition Management Service (TAPMS Tech Preview)
- Access allowed to heartbeat’s tunables OPA for
cray-heartbeat
- NCN CVE Remediation
- CVE remediation near zero - high/critical (container images)
- Replaced High/Critical CVE container use in Spire
- Addressed CVE remediation for
postgres-operator
- Addressed Expat-15: High/Critical CVE container use in UAS/UAI
Customer-requested enhancements
- Added the ability to list all lock conditions with Cray HSM locks API
- Enabled pressure stats on all nodes with Linux 5.x kernel
- Added initial (Tech Preview) support for API-driven NCN lifecycle operations driven via Argo workflows (for worker and storage NCN upgrades)
Documentation enhancements
- Added documentation for:
- Add/Remove/Replace NCN procedures
- Add/Remove/Replace compute nodes using
sat swap blade
- How to troubleshoot
ncn-m001
PXE loop
- NCN image modification using IMS and CFS
- Minimal space requirements for CSM V1.3.0
- The new
cray-externaldns-manager
service
- CAN documentation updated to reflect BICAN
Bug fixes
Deprecations
The following features are now deprecated and will be removed from CSM in a future release.
- BOS v1 is now deprecated, in favor of BOS v2. BOS v1 will be removed from CSM in the
CSM-1.9
release.
- It is likely that even prior to BOS v1 being removed from CSM, the Cray CLI will change its behavior when no
version is explicitly specified in BOS commands. Currently it defaults to BOS v1, but it may change to default to BOS v2 even before BOS v1
is removed from CSM.
For a list of all deprecated CSM features, including those that were deprecated in previous CSM releases but have not yet been removed,
see Deprecated Features.
Removals
- SLS support for downloading and uploading credentials in the
dumpstate
and loadstate
REST APIs
The following previously deprecated feature now has an announced CSM version when it will be removed:
- CRUS was deprecated in CSM 1.2, and will be removed in CSM 1.5.
For a list of all features with an announced removal target, see Removals.
Known issues
-
UAIs use a default route that sends outbound packets over the CMN, this will be addressed in a future release so that the default route uses the CAN/CHN.
-
On some systems, Ceph can begin to exhibit latency over time, and if this occurs it can eventually cause services like slurm
and services that are backed by etcd
clusters to exhibit slowness and possible timeouts.
See Known Issue: Ceph OSD latency for a workaround.
Security vulnerability exceptions in CSM 1.3
Significant effort went into the tracking, elimination, and/or reduction of critical or high (and lower) security vulnerabilities of container images included in the CSM 1.3 release.
There remain, however, a small number of exceptions that are listed below. General reasons for carrying exceptions include needing to version pin certain core components,
upstream fixes not being available, or new vulnerability detection or fixes occurring after release content is frozen. A new effort to track and address security vulnerabilities
of container images spins up with each major CSM release.
Image |
Reason |
csm-dckr/stable/quay.io/ceph/ceph:v15.2.15 |
This version of Ceph (Octopus) is needed in the upgrade procedure, but is not present after that. |
csm-docker/stable/quay.io/ceph/ceph:v15.2.16 |
This version of Ceph (Octopus) is needed in the upgrade procedure, but is not present after that. |
csm-docker/stable/quay.io/ceph/ceph:v16.2.9 |
This version of Ceph (Pacific) is pinned for the CSM 1.3 release. The next CSM version released as a part of a recipe will support Ceph (Quincy ). |
csm-docker/stable/quay.io/cephcsi/cephcsi:v3.6.2 |
Upstream fixes became available after CSM 1.3 release content was frozen. |
csm-dckr/stable/dckr.io/bitnami/external-dns:0.10.2-debian-10-r23 |
Upstream fixes are needed and are not yet available. |
csm-docker/stable/quay.io/kiali/kiali-operator:v1.36.7 |
The updated RedHat base image is available but not pulled in by upstream. See procedure to Remove Kiali if desired. |
csm-dckr/stable/k8s.gcr.io/kube-proxy:v1.20.13 |
This version is needed for the upgrade procedure but will not be running after the upgrade has been completed. |
csm-docker/stable/k8s.gcr.io/kube-proxy:v1.21.12 |
Upstream fixes are needed and are not yet available for the 1.21.12 version of Kubernetes included in CSM 1.3. |
csm-docker/stable/cray-postgres-db-backup:0.2.3 |
To ensure success of postgres restore functionality, we needed to pin to psql v12 in this image. |
csm-dckr/stable/dckr.io/nfvpe/multus:v3.7 |
Upstream fixes are needed and are not yet available, however we have engaged with the project to make a reduced-vulnerability version available. |
csm-docker/stable/docker.io/sonatype/nexus3:3.38.0-1 |
Upstream fixes are needed to the base image in order to address the remaining vulnerabilities. |
csm-docker/stable/cray-uas-mgr:1.21.0 |
This will be addressed in a future version of CSM. |