Cray System Management Documentation > Cray System Management (CSM) 1.6.2 Release Notes

Cray System Management (CSM) 1.6.2 Release Notes

This page documents the changes introduced by this patch, compared to the previous patch version of CSM.

For the main CSM 1.6 release notes page, including links to other patch release notes, see CSM 1.6 release notes.

Additions and improvements
Bug fixes
Known issues

Additions and improvements

General

Configuration Framework Service (CFS): Add bulk component update option to Cray CLI
- For more information, see Managing many components

Security

Fixed CVEs in the cmsdev test tool, cray-console-node, and cray-console-operator
Fixed CVEs in oauth2 proxies by disabling TLS1.2 support

Test

Add CFS node personalization to the Barebones Image Boot Test
Improved testing resilience in the spire_check_key_id_in_jwks goss test
Modified adjust k8s_nodes_ready_check.sh to not fail when a node is in Ready,SchedulingDisabled state
Modified velero_backups_check.sh to not fail if a newer, successful backup exists
Modified run_hms_ct_tests.sh to handle concurrency better
Fixed intermittent failures sometimes seen when running check_key_id_in_jwks.sh
Added retry logic to goss-postgresql-syncfailed.yaml to prevent intermittent false positives
Added retry logic to postgres_clusters_running.sh to prevent intermittent false positives
Added fix to prevent false positives in the Hardware State Manager (SMD) CT tests when components are in the DiscoveryStarted state when the tests are launched

Bug fixes

Fixes to the Boot Script Service (BSS) and cfs-trust to allow large scale parallel boots of compute nodes
Fix bug preventing CFS batcher from starting sessions on very large scale systems
Boot Orchestration Service (BOS): Gracefully handle requests to validate session templates which do not exist
Fix bug preventing Spire xname Validation from being enabled due to workloads files.
Fixes for several concurrency issues in Redfish Translation Service (RTS) that will reduce the number of pod restarts

Known issues

After updating Paradise BMC firmware, the hmcollector-poll service will lose event subscriptions and must be restarted
- See Updating Foxconn Paradise Nodes with FAS for details on how to do this
cfs-api pods in CLBO state during CSM install.
- When installing CSM 1.6, cray-shared-kafka-kafka- pods in the services namespace fail to come up which results in cfs-api pods in CLBO state.
- A workaround is presented in CFS API pods in CLBO.
istio-proxy containers fail with too many open files.
- This may happen when any pod with istio injection enabled is started.
- A workaround is presented in Istio-Proxy failing with too many open files
Install and Upgrade Framework (IUF) does not run the next stage for an activity
- During CSM upgrade, IUF reports that multiple sessions are in progress for an activity.
- A workaround is presented in IUF does not run the next stage for an activity
iSCSI based boot content projection may fail if the image to be projected does not have an etag
- A workaround is presented in iSCSI SBPS boot failure
CSM Automatic Network Utility (CANU) 1.8.0 and later is known to cause a brief Node Management Network (NMN) network outage.
- CANU 1.8.0 and later introduce a separation of administrative traffic and user traffic on the management network via addition of a new VRF and OSPF area. Until all switches are updated and new routes are propagated, there is a brief NMN network outage. IP addressing does not change, but NMN traffic will flow over a new isolated VRF channel. The length of the outage is dependent on the time to apply new switch configurations to all management network switches - OSPF will propagate routes within seconds. As this affects liquid-cooled Mountain cabinets, running jobs may be affected. A dedicated outage window is highly recommended for applying these changes.
System Monitoring Application (SMA) 1.10.15 and later includes an upgraded LDMS that introduces an incompatibility with configuration files used in prior versions.
- When upgrading from an older SMA version to a version with this new LDMS, the administrator must change the configuration files.
- A workaround is presented as an Action in the deliver-product stage in the IUF Stage Details for SMA section of the HPE Cray Supercomputing EX System Monitoring Application Installation Guide.
Services that use PostgreSQL may fail when a Kubernetes master node is rebooted or rebuilt.
- A PostgreSQL database may fail over without clients reconnecting to the new cluster leader.
- A workaround is presented in PostgreSQL Database is in Recovery
cray-uas-mgr may still be running on a system upgraded from CSM 1.5.
- UAI was removed in CSM 1.6.0 but systems upgraded from CSM 1.5 may still have the cray-uas-mgr service and associated etcd cluster present.
- A workaround is presented in Remove User Access Service.

For a full list of known issues, see Known issues.