Cray System Management (CSM) 1.6.2 Release Notes

This page documents the changes introduced by this patch, compared to the previous patch version of CSM.

For the main CSM 1.6 release notes page, including links to other patch release notes, see CSM 1.6 release notes.

Additions and improvements

General

Security

  • Fixed CVEs in the cmsdev test tool, cray-console-node, and cray-console-operator
  • Fixed CVEs in oauth2 proxies by disabling TLS1.2 support

Test

  • Add CFS node personalization to the Barebones Image Boot Test
  • Improved testing resilience in the spire_check_key_id_in_jwks goss test
  • Modified adjust k8s_nodes_ready_check.sh to not fail when a node is in Ready,SchedulingDisabled state
  • Modified velero_backups_check.sh to not fail if a newer, successful backup exists
  • Modified run_hms_ct_tests.sh to handle concurrency better
  • Fixed intermittent failures sometimes seen when running check_key_id_in_jwks.sh
  • Added retry logic to goss-postgresql-syncfailed.yaml to prevent intermittent false positives
  • Added retry logic to postgres_clusters_running.sh to prevent intermittent false positives
  • Added fix to prevent false positives in the Hardware State Manager (SMD) CT tests when components are in the DiscoveryStarted state when the tests are launched

Bug fixes

Known issues

  • After updating Paradise BMC firmware, the hmcollector-poll service will lose event subscriptions and must be restarted
  • cfs-api pods in CLBO state during CSM install.
    • When installing CSM 1.6, cray-shared-kafka-kafka- pods in the services namespace fail to come up which results in cfs-api pods in CLBO state.
    • A workaround is presented in CFS API pods in CLBO.
  • istio-proxy containers fail with too many open files.
  • Install and Upgrade Framework (IUF) does not run the next stage for an activity
  • iSCSI based boot content projection may fail if the image to be projected does not have an etag
  • CSM Automatic Network Utility (CANU) 1.8.0 and later is known to cause a brief Node Management Network (NMN) network outage.
    • CANU 1.8.0 and later introduce a separation of administrative traffic and user traffic on the management network via addition of a new VRF and OSPF area. Until all switches are updated and new routes are propagated, there is a brief NMN network outage. IP addressing does not change, but NMN traffic will flow over a new isolated VRF channel. The length of the outage is dependent on the time to apply new switch configurations to all management network switches - OSPF will propagate routes within seconds. As this affects liquid-cooled Mountain cabinets, running jobs may be affected. A dedicated outage window is highly recommended for applying these changes.
  • System Monitoring Application (SMA) 1.10.15 and later includes an upgraded LDMS that introduces an incompatibility with configuration files used in prior versions.
    • When upgrading from an older SMA version to a version with this new LDMS, the administrator must change the configuration files.
    • A workaround is presented as an Action in the deliver-product stage in the IUF Stage Details for SMA section of the HPE Cray Supercomputing EX System Monitoring Application Installation Guide.
  • Services that use PostgreSQL may fail when a Kubernetes master node is rebooted or rebuilt.
  • cray-uas-mgr may still be running on a system upgraded from CSM 1.5.
    • UAI was removed in CSM 1.6.0 but systems upgraded from CSM 1.5 may still have the cray-uas-mgr service and associated etcd cluster present.
    • A workaround is presented in Remove User Access Service.

For a full list of known issues, see Known issues.