This page will guide an administrator through installing Cray System Management (CSM) on an HPE Cray EX system. Fresh-installations on bare-metal or re-installations of CSM must follow this guide in order.
Before installing, review the Release Notes.
Introduced in CSM 1.2, a major feature of CSM is the Bifurcated CAN (BICAN). The BICAN is designed to separate administrative network traffic from user network traffic. More information can be found on the BICAN Technical Summary. Review the BICAN summary before continuing with the CSM install. For detailed BICAN documentation, see BICAN Technical Details.
Servers with NVIDIA CPUs and GPUs are not supported by CSM 1.6.0.
The January 2025 HPE HPC continuous software stack releases (CSM 1.6.0) are for HPE Cray EX systems without NVIDIA CPUs and GPUs. For HPE Cray EX systems with NVIDIA CPUs and GPUs, please use the August 2024 (CSM 1.5.x) HPE HPC continuous software stack. These software stacks were validated with NVIDIA HPC SDK 24.3.
The March 2025 HPE HPC continuous and extended software stack releases will be validated with NVIDIA HPC SDK 24.11. The March 2025 (CSM 1.6.1) software stacks will support all HPE Cray EX systems.
In the Pre-installation section of the install, information about the HPE Cray EX system and the site is used to prepare the configuration payload. An initial node called the PIT node is then set up to bootstrap the installation process. It is called the PIT node because the Pre-Install Toolkit is installed there. The management network switches are also configured in this section.
In the Installation section of the install, the other management nodes are deployed with an operating system and the software required to create a Kubernetes cluster utilizing Ceph storage. The CSM services are then deployed in the Kubernetes cluster to provide essential software infrastructure including the API gateway and many micro-services with REST APIs for managing the system. Administrative access is then configured, and the health of the system is validated before proceeding with operational tasks like checking and updating firmware on system components and preparing compute nodes.
The Post-installation section covers tasks which are performed after the main install procedure is completed.
The final section, Installation of additional HPE Cray EX software products describes how to install additional HPE Cray EX software products using the Install and Upgrade Framework (IUF).
The topics in this chapter need to be done as part of an ordered procedure so are shown here with numbered topics.
NOTE
If problems are encountered during the installation, Troubleshooting installation problems and Cray System Management (CSM) Administration Guide will offer assistance.
This section will guide the administrator through creating and setting up the Cray Pre-Install Toolkit (PIT).
Fresh-installations may start at the Boot installation environment section. Re-installations will have other steps to complete in the Preparing for a re-installation section.
If one is reinstalling a system, the existing cluster needs to be wiped and powered down.
See Prepare Management Nodes, and then come back and proceed to the Pre-Installation guide.
These steps walk the user through properly setting up an HPE Cray supercomputer for an installation.
See Pre-installation.
See Boot installation environment.
See Download and extract the CSM tarball.
See Create system configuration.
See Validate the LiveCD.
IMPORTANT
The HMS Discovery hardware discovery process,
Power Control Service (PCS)/Redfish Translation Service (RTS)
management switch availability monitoring, and the Prometheus SNMP Exporter depend on SNMP. To ensure that these services function correctly, validate
the SNMP settings in the system to ensure that the management network switches have SNMP enabled and
that the SNMP credentials configured on the switches match the credentials stored in Vault and
customizations.yaml
.
If SNMP is misconfigured, then hardware discovery by HMS Discovery Kubernetes cronjob, PCS/RTS management switch availability monitoring, and the Prometheus SNMP Exporter may fail to operate correctly. For more information, see Configure SNMP.
If CSM is being installed to an environment that already has a working management network (such as during a
reinstall), then validate that the SNMP credentials seeded into customizations.yaml
in the previous
Create Baseline System Customization step
of the install matches the SNMP password configured on the management network switches.
If the passwords do not match, then either update customizations.yaml
to match the switches, or change the
switches to match customizations.yaml
. For procedures for either option, see
Configure SNMP.
Note: While the CSM Automatic Network Utility (CANU) will typically not overwrite SNMP settings that are manually applied to the management switches, there are certain cases where SNMP configuration can be over-written or lost (such as when resetting and reconfiguring a switch from factory defaults). To persist the SNMP settings, see CANU Custom Configuration. CANU custom configuration files are used to persist site management network configurations that are intended to take precedence over configurations generated by CANU.
Create a CANU custom configuration that configures SNMP on the management network switches, using the same credentials that were previously used in the Create Baseline System Customization page of the installation. Use this custom configuration with CANU in the next step of the install.
Store the custom configuration in a version control repository along with other configuration assets from the CSM install.
See Configure SNMP for more information about configuring SNMP in CSM.
At this point external connectivity has been established, and either bare-metal configurations can be installed or new/updated configurations can be applied.
Most installations will require the following three tasks, although this may vary depending on site-specific settings and procedures.
See Management Network User Guide for information on next steps for a variety of network configuration scenarios.
Note that the configuration of the management network is an advanced task that may require the help of a networking subject matter expert.
The first nodes to deploy are the management nodes. These Non-Compute Nodes (NCNs) will host CSM services that are required for deploying the rest of the supercomputer.
NOTE
The PIT node will join Kubernetes after it is rebooted later in Deploy final NCN.
Now that deployment of management nodes is complete with initialized Ceph storage and a running Kubernetes cluster on all worker and master nodes, except the PIT node, the CSM services can be installed. The Nexus repository will be populated with artifacts; containerized CSM services will be installed; and a few other configuration steps will be taken.
See Install CSM Services.
After installing all of the CSM services, now validate the health of the management nodes and all CSM services. The reason to do it now is that if there are any problems detected with the core infrastructure or the nodes, it is easy to rewind the installation to Deploy management nodes, because the final NCN has not yet been deployed. In addition, deploying the final NCN successfully requires several CSM services to be working properly.
See Validate CSM Health.
Now that all CSM services have been installed and the CSM health checks completed, the PIT has served its purpose and the final NCN can be deployed. The node used for the PIT will be rebooted, this node will be the final NCN to deploy in the CSM install.
See Deploy Final NCN.
Now that all of the CSM services have been installed and the final NCN has been deployed, administrative access can be prepared. This may include:
cray
) for administrative commandsSee Configure Administrative Access.
Now that all management nodes have joined the Kubernetes cluster, CSM services have been installed, and administrative access has been enabled, the health of the management nodes and all CSM services should be validated. There are no exceptions to running the tests – all tests should be run now.
This CSM health validation can also be run at other points during the system lifecycle, such as when replacing a management node, checking the health after a management node has rebooted because of a crash, as part of doing a full system power down or power up, or after other types of system maintenance.
See Validate CSM Health.
Now that CSM has been installed and health has been validated, if the system management health monitoring tools (specifically Prometheus) are found to be useful, then email notifications can be configured for specific alerts defined in Prometheus. Prometheus upstream documentation can be leveraged for an Alert Notification Template Reference as well as Notification Template Examples. Currently supported notification types include Slack, Pager Duty, email, or a custom integration via a generic webhook interface.
See Configure Prometheus Email Alert Notifications for an example configuration of an email alert notification for the Postgres replication alerts that are defined on the system.
OPTIONAL: This is an optional step.
This uses netstat
collector form node-exporter and enables all the SNMP counters monitoring in /proc/net/snmp
on ncn
nodes.
See Update ceph node-exporter configuration to update the ceph node-exporter configuration to monitor SNMP counters.
IMPORTANT: Before Firmware can be updated the HPC Firmware Pack (HFP) must be installed refer to the HPE Cray EX System Software Getting Started Guide S-8000 on the HPE Customer Support Center for more information about how to install the HPE Cray EX HPC Firmware Pack (HFP) product.
The Olympus hardware needs to have recovery firmware loaded to the cray-tftp
server in case the BMC loses its firmware. The BMCs are configured to load a recovery firmware from a TFTP server.
This procedure does not modify any BMC firmware, but only stages the firmware on the TFTP server for download in the event it is needed.
See Load Olympus BMC Recovery Firmware into TFTP server.
Now that all management nodes and CSM services have been validated as healthy, the firmware on other components in the system can be checked and updated. The Firmware Action Service (FAS) communicates with many devices on the system. FAS can be used to update the firmware for all of the devices it communicates with at once, or specific devices can be targeted for a firmware update.
After completion of the firmware update with FAS, compute nodes can be prepared. Some compute node types have special preparation steps, but most compute nodes are ready to be used now.
These compute node types require preparation:
The installation of the CSM product requires knowledge of the various nodes and switches for the HPE Cray EX system. The procedures in this section should be referenced during the CSM install for additional information on system hardware, troubleshooting, and administrative tasks related to CSM.
See Troubleshooting Installation Problems.
As an optional post installation task, encryption of Kubernetes secrets may be enabled. This enables at rest encryption of data in the etcd
database used by Kubernetes.
Warning: This process can take multiple hours where Nexus is unavailable.
After the install, it is recommended that a Nexus export is taken. This is not a required step but highly recommend to protect the data in Nexus.
See Nexus Export and Restore Procedure for details.
Once installation of CSM has been completed, additional HPE Cray EX software products can be installed via the Install and Upgrade Framework (IUF).
See the Install or upgrade additional products with IUF procedure to continue with the installation of additional HPE Cray EX software products.
For additional information on the IUF, see Install and Upgrade Framework.