Cray System Management (CSM) - Release Notes
CSM 1.0.11
The following lists enumerate the improvements and enhancements since CSM 1.0.10
New functionality in 1.0.11
- Backport current cabinet expansion procedure from CSM 1.2 into the CSM 1.0 docs
Bug fixes in 1.0.11
- SECURITY:
CVE-2022-0185
: Linux kernel buffer overflow/container escape
- SECURITY:
CVE-2021-4034
: pwnkit: Local Privilege Escalation in polkit's pkexec
- SECURITY: Address
log4j
vulnerabilities with regards to kafka
in the CSM 1.0.11 patch
- SECURITY: Update
strimzi
operator 0.15.0 to use patched kafka
images
- Bug Fix: CSM upgrade incorrectly records CPS nodes
- Bug Fix:
update_bss_metadata.sh
is not executable
- Bug Fix: Upgrade waiting for boot error
- Bug Fix: Upgrade to CSM 1.0.11,
m002
upgrade fails to join Kubernetes
- Bug Fix: Automate reinitialization of cluster members found to be lagging in CSM 1.0
ncn-upgrade-k8s-worker.sh
- Bug Fix: USB device wiped on
ncn-m001
during CSM 1.0.0 -> CSM 1.0.1 upgrade
- Bug Fix:
cray-dns-unbound
in CLBO after restart during CSM 1.0.0 -> 1.0.11 upgrade
- Bug Fix: Reboot of storage node (
s003
) halted for raid sync/health
- Bug Fix: Clock skew on storage node during reboot test
- Bug Fix: CSM 1.0 Broker UAI Image 1.2.3 is missing
openssh
- Documentation Fix: Procedure to set
metal.nowipe
before and after a management node rebuild missing steps
- Documentation Fix: First
curl
command inupdate_management_network.md
has incorrect output
- Documentation Fix: NCN rebuild procedure missing a workaround to prevent duplicate IP addresses on NCNs
- Documentation Fix: CSM 1.0.0 to CSM 1.0.11 update is supported
- Documentation Fix:
bootstrap_livecd_remote_iso.md
- Copy of typescript log uses incorrect directory path
- Documentation Fix: CFS NCN personalization will always fail on 1.0.11 if coming from 1.0.10
- Documentation Fix: Goss server RPM is missing in desired location
CSM 1.0.10
The following lists enumerate the improvements and enhancements since CSM 1.0.1
New functionality in 1.0.10
- Adds hardware discovery and power control for Bard Peak Olympus blades. (Power-capping not supported yet.)
Bug fixes in 1.0.10
- Fixes an intermittent issue where kernel dumps would not deliver because the CA certificate for Spire needed to be reset.
- Fixes an intermittent issue where PXE booting of NCNs was timing out.
- Fixes an intermittent UX issue where console was replaying output.
- Fixes an issue with FAS loader not handling the new Slingshot 1.6 firmware version scheme.
- Fixes an issue where Ceph services were not auto-starting after a reboot of a storage node.
- Fixes an issue where later SSH connections to Ceph were producing host key verification errors.
- Fixes an issue where DNS requests would briefly fail during hardware discovery or an upgrade.
- Fixes an issue preventing SCSD changing root credentials for
DL325/385
.
- Fixes an intermittent issue where Gigabyte firmware updates via FAS would return an error.
- Fixes a rare issue where Nexus would not be available when scaling down to two nodes.
- Fixes an issue where the boot order for Gigabyte NCNs was not persisting after a reboot or reinstall.
- Fixes an intermittent issue where storage nodes would have clock skew during fresh install.
CSM 1.0.1
The following lists enumerate major improvements since CSM 0.9.
What’s new in 1.0.1
New functionality in 1.0.1
- Scale up to 6000 nodes is supported.
- Conman has been updated to using a deployment model that handles a larger scale.
- Additional scaling improvements have been incorporated into several services including, Unbound, Kea, Hardware State Manager (HSM), and Spire.
- Upgrades between major versions are now allowed.
- The CSM installation process has been improved.
- Over 20 workarounds from the prior CSM release have been removed.
- A significant number of installation-related enhancements have been integrated, both functionally and through documentation.
- Installation validation testing has been improved by updating existing validation tests and also adding additional tests.
Enhanced documentation in 1.0.1
- CSM operational documentation has been changed to markdown format for standardized deployment.
- CSM Administration Guides and Operational Procedures are now also available online.
- Searchable HTML
- Source
- Improvements have been made to the documentation regarding installation, operations, and troubleshooting.
- Contains backup and restore procedures for Keycloak, Spire and SLS services.
- Provides guidance on setting the timezone for customer systems.
- Explains how to build application node xnames.
New hardware support in 1.0.1
- AMD Rome-Based HPE Apollo 6500 XL675d Gen10+ with NVIDIA 40GB A100 GPU for use as a Compute Node.
- AMD Rome-Based HPE Apollo 6500 XL645d Gen10+ with NVIDIA 40GB A100 GPU for use as a Compute Node.
- AMD Rome-Based HPE DL385 Gen10+ with NVIDIA 40GB A100 GPU for use as a User Access Node.
- AMD Rome-Based HPE DL 385 Gen10+ with AMD Mi100 GPU for use as a User Access Node.
- AMD Milan-Based HPE DL 385 with NVIDIA 40 GB A100 GPU for use as a User Access Node.
- AMD Milan-Based HPE Apollo 6500/XL645d Gen10+ with NVIDIA 80GB A100 GPU for use as a Compute Node.
- AMD Milan-Based Windom Blade with NVIDIA 40 GB A100 GPU for use as a Compute Node.
- AMD Milan-Based Grizzly Peak Blade with NVIDIA 40 GB A100 GPU for use as a Compute Node.
- AMD Milan-Based Grizzly Peak Blade with NVIDIA 80 GB A100 GPU for use as a Compute Node.
- Aruba CX8325, 8360, and 6300M network switches
- Istio version 1.7.8, running in
distroless
mode
Containerd
version 1.4.3
- Kubernetes version 1.19.9
- Weave version 2.8.0
- Etcd API version 3.4 (
etcdctl
version 3.4.14)
Coredns
version 1.7.0
- Ceph version
15.2.12-83-g528da226523
(octopus)
Security improvements in 1.0.1
- Ansible Plays have been created to update management node Operating System Passwords and SSH Keys.
- A significant number of security enhancements have been implemented to eliminate vulnerabilities and provide security hardening
- The removal of clear-text passwords in CSM install scripts
- Incorporation of trusted-base operating systems in containers
- And addresses many critical security CVEs.
Customer requested enhancements in 1.0.1
- Error logging for the BOS session template must be improved.
- IMS must provide a way to clean up sessions without the use of
jq
and xargs
.
- CFS batcher needs a maximum limit for session creation suspension.
- BOS session should help to identify the resulting CFS job.
- DHCP lease time should be increased. (It was increased to 3600 seconds.)
- Helm charts should have a way to be automatically patched during Shasta installation.
- HSM should add a timestamp to State Change Notifications (SCN) data before publishing to Kafka topic:
cray-hmsstatechange-notifications
.
- End-of-Life Alpine and
nginx
container images must be removed for security purposes.
- CAPMC simulates reinitialization on hardware that does not support restart; see CAPMC
reinit
and configuration for more information
Bug fixes in 1.0.1
The following list enumerates the more important issues that were found and fixed in CSM 1.0.1. In total, there were more than 34 customer-reported issues and more than 350 development critical issues fixed in this release.
Critical issues resolved:
- Prometheus cannot scrape
kubelet
/kube-proxy
.
- CFS can run layers out of order.
- The BSS - BOS session boot parameter update seems slow.
- Compute nodes fail to PXE boot and drop into the EFI shell.
- The NCN personalization of
ncn-m002
and m003
seems to be in endless loop.
- FAS is claiming to have worked on CMM, but it did not during a 1.4 install.
- The
cray-hbtd
is reporting “Telemetry bus not accepting messages, heartbeat event not sent.”
- When talking to SMD, commands are failing with an
Err 503
.
- The command
cray hsm inventory ethernetInterfaces update --component-id
is rejected as invalid.
- There is a high rate of failed DNS queries being forwarded off-system by unbound.
- NCN worker node’s HSN connection is not being renamed to
hsn0
or hsn1
.
- During large-scale node boots, a small number of nodes are stuck with fetching authentication token failed.
- Zypper install is broken because it tries to talk to
suse.com
.
- Node exporters failed to parse
mountinfo
and are not running on storage NCNs.
- The
cfs-hwsync-agent
is repeatedly dying with RC 137 due to an OOM issue.
- The Gitea PVC is unavailable following a full system cold reboot.
sysmgmt-health
references docker.io/jimmidyson/configmap-reload:v0.3.0
, which cannot be loaded.
- Upstream NTP is not appearing in the
chronyd
configuration.
- Incorrect packaged firmware metadata is being reported by FAS for the NCN’s iLO/BIOS.
- At scale, there is a DNS slowdown when attempting to reboot all of the nodes, causing DNS lookup failures.
cray-sysmgmt-health-kube-state-metrics
uses an image of kube-state-metrics:v1.9.6
, which contains a bug that causes alerts.
- CFS teardown reports swapped image results.
- The unbound DNS manager job is not functioning, so compute nodes cannot be reached.
- The Keycloak service crashed because it was out of Java heap space.
- Unbound should not forward to site DNS for Shasta zones.
- Kubernetes pod priority support needs to be in Kubernetes image.
- CFS is running multiple sessions for the same nodes at the same time.
- The CFS CLI does not allow tags on session create.
- CFS should check if the configuration is valid/exists when a session is created.
- CFS does not set session start time until after job starts.
- CFS will not list pending sessions.
- MEDS should not overwrite a components credentials when an xname becomes present again.
- The Cray HSM locks command locked more nodes than specified.
- HSM crashes when discovering Bard Peak.
- Resources limits are hit on three NCN systems.
- Conman is unable to connect to compute consoles.
- For better reliability, the orphan stratum in
Chrony
configuration needed to be adjusted.
- The UEFI Boot Order Reverts/Restores on every reboot on an HPE
DL325
.
and many more…
Known issues
- When a Boot Orchestration Service (BOS) session fails, it may output a message in the Boot Orchestration Agent (BOA) log associated with that session.
This output contains a command that instructs the user how to re-run the failed session. It will only contain the nodes that failed during that session.
The command is faulty, and this issue addresses correcting it.
- Under some circumstances, Configuration Framework Service (CFS) sessions can get stuck in a
pending
state, never completing and potentially blocking other sessions.
This addresses cleaning up those sessions.
- The
branch
parameter in CFS configurations may not work, and setting it will instead return an error. Continue setting the git commit hash instead.
- After a boot or reboot a few CFS Pods may continue running even after they have finished and never go away. For more information see
Orphaned CFS Pods After Booting or Rebooting.
- Intermittently, kernel dumps do not deliver because the CA cert for Spire needed to be reset.
- Intermittently, PXE booting of NCNs time out.
- Intermittently, Console replays output.
- FAS loader is not handling the new Slingshot 1.6 firmware update because of its new version scheme.
- Ceph services do not auto-start after a reboot of a storage node.
- Intermittently, SSH connections to Ceph show host key verification errors.
- Intermittently, DNS requests briefly fail during hardware discovery or an upgrade.
- SCSD is not able to change root credentials for DL325/385 due to a bug in the 11.2021 iLO firmware.
- Intermittently, Gigabyte firmware updates via FAS show an error.
- Rarely, Nexus is not available when scaling down NCN workers to two nodes.
- The boot order for Gigabyte NCNs does not persist after a reboot or reinstall.
- Intermittently, storage nodes have clock skew during fresh install.
kube-multus
pods may fail to restart due to ImagePullBackOff
. For more information see kube-multus
pod is in ImagePullBackOff
.
- Power capping Olympus and River compute hardware via CAPMC is not supported.
- On fresh install, API calls to Gitea/VCS may give 401 Errors. See Gitea/VCS 401 Errors for more information.
- Console logging may fill all available space for console log files. See Console logs filling up available storage for more information.
For a full list of known issues, see Known issues.