Cray System Management (CSM) - Release Notes
CSM 1.0.11
The following lists enumerate the improvements and enhancements since CSM 1.0.10
New Functionality
- Backport current cabinet expansion procedure from CSM 1.2 into the CSM 1.0 docs
Bug Fixes
- SECURITY: CVE-2022-0185: Linux kernel buffer overflow/container escape
- SECURITY: CVE-2021-4034: pwnkit: Local Privilege Escalation in polkit’s pkexec
- SECURITY: Address log4j vulnerabilities with regards to kafka in the CSM-1.0.11 patch
- SECURITY: Update strimzi operator 0.15.0 to use patched kafka images
- Bug Fix: csm upgrade incorrectly records CPS nodes
- Bug Fix: update_bss_metadata.sh is not executable
- Bug Fix: Upgrade waiting for boot error
- Bug Fix: Upgrade to csm-1.0.11, m002 upgrade fails to join k8s
- Bug Fix: Automate reinit of cluster members found to be lagging in csm 1.0 ncn-upgrade-k8s-worker.sh
- Bug Fix: USB device wiped on ncn-m001 during csm-1.0.0 -> csm-1.0.1 upgrade
- Bug Fix: cray-dns-unbound in CLBO after restart during csm-1.0.0 -> 1.0.11 upgrade
- Bug Fix: Reboot of storage node (s003) halted for raid sync/health
- Bug Fix: Clock skew on storage node during reboot test
- Bug Fix: csm-1.0 Broker UAI Image 1.2.3 is missing openssh
- Documentation Fix: Procedure to set metal.nowipe before and after a management node rebuild missing steps
- Documentation Fix: First curl command update_management_network.md has incorrect output
- Documentation Fix: ncn rebuild procedure missing a WAR to prevent dupe IPs on NCNs
- Documentation Fix: csm-1.0.0 to csm-1.0.11 update is supported
- Documentation Fix: bootstrap_livecd_remote_iso.md - Copy of typescript log uses incorrect directory path
- Documentation Fix: CFS ncn-personalization will always fail on 1.0.11 if coming from 1.0.10
- Documentation Fix: Goss server rpm is missing in desired location
CSM 1.0.10
The following lists enumerate the improvements and enhancements since CSM 1.0.1
New Functionality
- Adds hardware discovery and power control for Bard Peak Olympus blades. (Power-capping not supported yet.)
Bug Fixes
- Fixes an intermittent issue where kernel dumps would not deliver because the CA cert for Spire needed to be reset.
- Fixes an intermittent issue where PXE booting of NCNs was timing out.
- Fixes an intermittent UX issue where Console was replaying output.
- Fixes an issue with FAS loader not handling the new Slingshot 1.6 firmware version scheme.
- Fixes an issue where Ceph services were not auto-starting after a reboot of a storage node.
- Fixes an issue where later ssh connections to Ceph were producing host key verification errors.
- Fixes an issue where DNS requests would briefly fail during hardware discovery or an upgrade.
- Fixes an issue preventing SCSD changing root credentials for DL325/385.
- Fixes an intermittent issue where Gigabyte firmware updates via FAS would return an error.
- Fixes a rare issue where Nexus would not be available when scaling down to two nodes.
- Fixes an issue where the boot order for Gigabyte NCNs was not persisting after a reboot or reinstall.
- Fixes an intermittent issue where storage nodes would have clock skew during fresh install.
CSM 1.0.1
The following lists enumerate major improvements since CSM v0.9.x.
What’s New
Bug Fixes
The following list enumerates the more important issues that were found and fixed in CSM v1.0.1. In total, there were more than 34 customer-reported issues and more than 350 development critical issues fixed in this release.
Critical Issues Resolved:
- Prometheus cannot scrape kubelet/kube-proxy.
- CFS can run layers out of order.
- The cray-bss - bos session boot parameter update seems slow.
- Compute nodes fail to PXE boot and drop into the EFI shell.
- The ncn-personalization of ncn-m002 and m003 seems to be in endless loop.
- FAS is claiming to have worked on CMM, but it did not during a V1.4 install.
- The cray-hbtd is reporting “Telemetry bus not accepting messages, heartbeat event not sent.”
- When talking to SMD, Commands are failing with an Err 503.
- The command “cray hsm inventory ethernetInterfaces update –component-id” is rejected as invalid.
- There is a high rate of failed DNS queries being forwarded off-system by unbound.
- NCN worker node’s HSN connection is not being renamed to hsn0 or hsn1.
- During large-scale node boots, a small number of nodes are stuck with fetching auth token failed.
- Zypper install is broken because it tries to talk to suse.com.
- Node exporters failed to parse mountinfo and are not running on ncn-s0xx nodes.
- The cfs-hwsync-agent is repeatedly dying with RC=137 due to an OOM issue.
- The gitea pvc is unavailable following a full system cold reboot.
- sysmgmt-health references
docker.io/jimmidyson/configmap-reload:v0.3.0
, which cannot be loaded.
- Upstream NTP is not appearing in the chronyd config.
- Incorrect packaged firmware metadata is being reported by FAS for the NCN’s iLO/BIOS.
- At scale, there is a DNS slowdown when attempting to reboot all of the nodes, causing DNS lookup failures.
- cray-sysmgmt-health-kube-state-metrics uses an image of kube-state-metrics:v1.9.6, which contains a bug that causes alerts.
- CFS teardown reports swapped image results.
- The unbound DNS manager job is not functioning, so compute nodes cannot be reached.
- The Keycloak service crashed because it was out of java heap space.
- Unbound should not forward to site DNS for Shasta zones.
- k8s Pod Priority Support needs to be in k8s Image.
- CFS is running multiple sessions for the same nodes at the same time.
- The CFS CLI does not allow tags on session create.
- CFS should check if the configuration is valid/exists when a session is created.
- CFS does not set session start time until after job starts.
- CFS will not list pending sessions.
- MEDS, should not overwrite a components credentials when an xname becomes present again.
- The Cray HSM locks command locked more nodes than specified.
- HSM crashes when discovering Bard Peak.
- Resources limits are hit on three NCN systems.
- Conman is unable to connect to compute consoles.
- For better reliability, the orphan stratum in Chrony config needed to be adjusted.
- The UEFI Boot Order Reverts/Restores on every reboot on an HPE DL325.
and many more…
Known Issues
- Incorrect_output_for_bos_command_rerun: When a Boot Orchestration Service (BOS) session fails, it may output a message in the Boot Orchestration Agent (BOA) log associated with that session. This output contains a command that instructs the user how to re-run the failed session. It will only contain the nodes that failed during that session. The command is faulty, and this issue addresses correcting it.
- Cfs_session_stuck_in_pending: Under some circumstances, Configuration Framework Service (CFS) sessions can get stuck in a
pending
state, never completing and potentially blocking other sessions. This addresses cleaning up those sessions.
- The
branch
parameter in CFS configurations may not work, and setting it will instead return an error. Continue setting the git commit hash instead.
- After a boot or reboot a few CFS Pods may continue running even after they have finished and never go away. For more information see Orphaned CFS Pods After Booting or Rebooting.
- Intermittently, kernel dumps do not deliver because the CA cert for Spire needed to be reset.
- Intermittently, PXE booting of NCNs time out.
- Intermittently, Console replays output.
- FAS loader is not handling the new Slingshot 1.6 firmware update because of its new version scheme.
- Ceph services do not auto-start after a reboot of a storage node.
- Intermittently, ssh connections to Ceph show host key verification errors.
- Intermittently, DNS requests briefly fail during hardware discovery or an upgrade.
- SCSD is not able to change root credentials for DL325/385 due to a bug in the 11.2021 iLO firmware.
- Intermittently, Gigabyte firmware updates via FAS show an error.
- Rarely, Nexus is not available when scaling down NCN workers to two nodes.
- The boot order for Gigabyte NCNs does not persist after a reboot or reinstall.
- Intermittently, storage nodes have clock skew during fresh install.
- Kube-multus pods may fail to restart due to ImagePullBackOff. For more information see Kube-multus pod is in ImagePullBackOff.
- Power capping Olympus and River compute hardware via CAPMC is not supported.
- On fresh install, API calls to Gitea/VCS may give 401 Errors. See Gitea/VCS 401 Errors for more information.
- Console logging may fill all available space for console log files. See Console logs filling up availble storage for more information.