Note: If installing CSM 1.2.0, then in order to avoid performing a patch update later, instead install the latest released CSM 1.2.x version.
Installation of the CSM product stream has many steps in multiple procedures which should be done in a specific order. Information about the HPE Cray EX system and the site is used to prepare the configuration payload. The initial node used to bootstrap the installation process is called the PIT node because the Pre-Install Toolkit (PIT) is installed there.
Once the management network switches have been configured, the other management nodes can be deployed with an operating system and the software to create a Kubernetes cluster utilizing Ceph storage. The CSM services provide essential software infrastructure including the API gateway and many micro-services with REST APIs for managing the system. Once administrative access has been configured, the installation of CSM software can be validated with health checks before doing operational tasks like the checking and updating of firmware on system components or the preparation of compute nodes.
Once the CSM installation has completed, other product streams for the HPE Cray EX system can be installed.
A major feature of CSM 1.2 is the Bifurcated CAN (BICAN). The BICAN is designed to separate administrative network traffic from user network traffic. More information can be found on the BICAN Technical Summary. Review the BICAN summary before continuing with the CSM install. For detailed BICAN documentation, see BICAN Technical Details.
The installation of the Cray System Management (CSM) product requires knowledge of the various nodes and switches for the HPE Cray EX system.
For additional installation-specific troubleshooting information, see Troubleshooting Installation Problems. Some topics also have supplementary troubleshooting sections listed in the CSM Operations index.
The topics in this chapter need to be done as part of an ordered procedure so are shown here with numbered topics.
Validate SHCD
The cabling should be validated between the nodes and the management network switches. The information in the Shasta Cabling Diagram (SHCD) can be used to confirm the cables which physically connect components of the system. Having the data in the SHCD which matches the physical cabling will be needed later in both Prepare configuration payload and Configure management network switches.
See Validate SHCD.
Note: If a reinstall or fresh install of this software release is being done on this system and the management network cabling has already been validated, then skip this step and move to Prepare configuration payload.
Prepare configuration payload
Information gathered from a site survey is needed to feed into the CSM installation process, such as system name, system size, site network information for the Customer Access Network (CAN), site DNS configuration, site NTP configuration, network information for the node used to bootstrap the installation. Much of the information about the system hardware is encapsulated in the SHCD (Shasta Cabling Diagram), which is a spreadsheet prepared by HPE Cray Manufacturing to assemble the components of the system and connect appropriately labeled cables.
Prepare management nodes
Some preparation of the management nodes might be needed before starting an install or reinstall. The preparation includes checking and updating the firmware on the PIT node, quiescing the compute nodes and application nodes, scaling back DHCP on the management nodes, wiping the storage on the management nodes, powering off the management nodes, and possibly powering off the PIT node.
Bootstrap PIT node
The Pre-Install Toolkit (PIT) node needs to be bootstrapped from the LiveCD. There are two media available to bootstrap the PIT node – the RemoteISO or a bootable USB device. The recommended media is the RemoteISO, because it does not require any physical media to prepare. However, remotely mounting an ISO on a BMC does not work smoothly for nodes from all vendors. It is recommended to try the RemoteISO first.
Use one of these procedures to bootstrap the PIT node from the LiveCD:
Using the LiveCD USB method requires a USB 3.0 device with at least 1TB of space to create a bootable LiveCD.
Configure management network switches
Now that the PIT node has been booted with the LiveCD environment and Cray Site Init (CSI) has generated the switch IP addresses, the management network switches can be configured.
See Management Network User Guide.
Note: If a reinstall of this software release is being done on this system and the management network switches have already been configured, then skip this step and move to Collect MAC addresses for NCNs.
Collect MAC addresses for NCNs
Now that the PIT node has been booted with the LiveCD and the management network switches have been configured,
the actual MAC address for the management nodes can be collected. This process will include repetition of some
of the steps done up to this point because csi config init
will need to be run with the proper
MAC addresses.
See Collect MAC Addresses for NCNs.
Note: If a reinstall of this software release is being done on this system and the ncn_metadata.csv
file already had valid MAC addresses for both BMC and node interfaces before csi config init
was run, then
this topic could be skipped and instead move to Deploy management nodes.
Note: If a first time install of this software release is being done on this system and the ncn_metadata.csv
file already had valid MAC addresses for both BMC and node interfaces before csi config init
was run, then
this topic could be skipped and instead move to Deploy management nodes.
Deploy management nodes
Now that the PIT node has been booted with the LiveCD and the management network switches have been configured, the other management nodes can be deployed. This procedure will boot all of the management nodes, initialize Ceph storage on the storage nodes and start the Kubernetes cluster on all of the worker nodes and the master nodes, except for the PIT node. The PIT node will join Kubernetes after it is rebooted later in Deploy final NCN.
Install CSM services
Deployment of management nodes is complete with initialized Ceph storage and a running Kubernetes cluster on all worker and master nodes, except the PIT node. The Nexus repository will be populated with artifacts, containerized CSM services will be installed, and a few other configuration steps will be taken.
See Install CSM Services.
Validate CSM health
Validate the health of the management nodes and all CSM services. The reason to do it now is that if there are any problems detected with the core infrastructure or the nodes, it is easy to rewind the installation to Deploy management nodes, because the PIT node has not yet been rebooted. In addition, rebooting the PIT node and deploying the final NCN successfully requires several CSM services to be working properly, so validating this is important.
See Validate CSM Health.
Deploy final NCN
Now that all CSM services have been installed and the CSM health checks completed, with the possible exception of the User Access Service (UAS)/User Access Instance (UAI) tests, the PIT node can be rebooted to leave the LiveCD environment and assume its intended role as one the Kubernetes master nodes.
See Deploy Final NCN.
Configure administrative access
Now that all of the CSM services have been installed and the final NCN has been deployed, administrative access can be prepared. This may include:
cray
) for administrative commandsValidate CSM health
Now that all management nodes have joined the Kubernetes cluster, CSM services have been installed, and administrative access has been enabled, the health of the management nodes and all CSM services should be validated. There are no exceptions to running the tests – all tests should be run now.
This CSM health validation can also be run at other points during the system lifecycle, such as when replacing a management node, checking the health after a management node has rebooted because of a crash, as part of doing a full system power down or power up, or after other types of system maintenance.
See Validate CSM Health.
Configure Prometheus alert notifications
Now that CSM has been installed and health has been validated, if the system management health monitoring tools and specifically, Prometheus, are found to be useful, email notifications can be configured for specific alerts defined in Prometheus. Prometheus upstream documentation can be leveraged for an Alert Notification Template Reference as well as Notification Template Examples. Currently supported notification types include Slack, Pager Duty, email, or a custom integration via a generic webhook interface.
See Configure Prometheus Email Alert Notifications for an example configuration of an email alert notification for the Postgres replication alerts that are defined on the system.
Update firmware with FAS
Now that all management nodes and CSM services have been validated as healthy, the firmware on other components in the system can be checked and updated. The Firmware Action Service (FAS) communicates with many devices on the system. FAS can be used to update the firmware for all of the devices it communicates with at once, or specific devices can be targeted for a firmware update.
IMPORTANT: Before FAS can be used to update firmware, refer to the
HPE Cray EX System Software Getting Started Guide (S-8000) 22.07
for more information about how to install
the HPE Cray EX HPC Firmware Pack (HFP) product. The installation of HFP will inform FAS of the newest firmware
available. Once FAS is aware that new firmware is available, then see
Update Firmware with FAS.
Prepare compute nodes
After completion of the firmware update with FAS, compute nodes can be prepared. Some compute node types have special preparation steps, but most compute nodes are ready to be used now.
These compute node types require preparation:
Apply security hardening
After preparing compute nodes, and prior to the installation of other product streams, review the security hardening guide.
Next topic
After completion of the firmware update with FAS and the preparation of compute nodes, the CSM product stream has
been fully installed and configured.
Refer to the HPE Cray EX System Software Getting Started Guide (S-8000) 22.07
on the HPE Customer Support Center
for more information on other product streams to be installed and configured after CSM.