In the Pre-installation section of the install, information about the HPE Cray EX system and the site is used to prepare the configuration payload. An initial node called the PIT node is then set up to bootstrap the installation process. It is called the PIT node because the Pre-Install Toolkit is installed there. The management network switches are also configured in this section.
In the Installation section of the install, the other management nodes are deployed with an operating system and the software required to create a Kubernetes cluster utilizing Ceph storage. The CSM services are then deployed in the Kubernetes cluster to provide essential software infrastructure including the API gateway and many micro-services with REST APIs for managing the system. Administrative access is then configured, and the health of the system is validated before proceeding with operational tasks like checking and updating firmware on system components and preparing compute nodes.
The Post-installation section covers tasks which are performed after the main install procedure is completed.
The final section, Installation of additional HPE Cray EX software products describes how to install additional HPE Cray EX software products using the Install and Upgrade Framework (IUF).
The topics in this chapter need to be done as part of an ordered procedure so are shown here with numbered topics.
Smartmon
metrics on storage NCNs
NOTE:
If problems are encountered during the installation, Troubleshooting installation problems and Cray System Management (CSM) Administration Guide will offer assistance.
The following must be verified before starting the Pre-installation procedure:
Ensure all the River Node BMCs are reachable and set to DHCP mode. Refer to Set node BMCs to DHCP.
Note: For bare-metal installation these settings will be default.
Ensure that the list of management switch IP addresses configured on vlan1
is available; this need to be shared or a serial console to the switches will be required.
Verify that the SHCD document is available with the component names (xnames) of the server.
Collect the IP addresses of the administrative node, site DNS, gateway, and proxy. Ensure that all these IP addresses are reachable from the administrative node.
Verify access to the administrative node and BMC.
Verify the ability to download the SLE-15-SP4-Full-x86_64
and cm-admin-install-1.8-sles15sp4-x86_64.iso
ISO files and the CSM tarball.
Verify that a minimum of 64GB storage is available on the administrative node.
This section will guide the administrator through creating and setting up the Cray Pre-Install Toolkit (PIT).
Fresh-installations may start at the Boot installation environment section. Re-installations will have other steps to complete in the Preparing for a re-installation section.
This section will guide the administrator through installing HPCM to generate seed files. The seed files will be used later in the step of the CSM installation.
See Boot Pre-Install Live ISO and Seed Files Generation.
If one is reinstalling a system, the existing cluster needs to be wiped and powered down.
See Prepare Management Nodes, then proceed to the Pre-Installation guide.
These steps walk the user through properly setting up a Cray supercomputer for an installation.
See Pre-installation.
See Boot installation environment.
See Import CSM tarball.
See Create system configuration.
IMPORTANT
The River Endpoint Discovery Service (REDS) hardware discovery process,
Power Control Service (PCS)/Redfish Translation Service (RTS)
management switch availability monitoring, and the Prometheus SNMP Exporter depend on SNMP. To ensure that these services function correctly, validate
the SNMP settings in the system to ensure that the management network switches have SNMP enabled
and that the SNMP credentials configured on the switches match the credentials stored in Vault and
customizations.yaml
.
If SNMP is misconfigured, then REDS hardware discovery, PCS/RTS management switch availability monitoring, and the Prometheus SNMP Exporter may fail to operate correctly. For more information, see Configure SNMP.
If CSM is being installed to an environment that already has a working management network (such as during a
reinstall), then validate that the SNMP credentials seeded into customizations.yaml
in the previous
Create Baseline System Customization step
of the install matches the SNMP password configured on the management network switches.
If the passwords do not match, then either update customizations.yaml
to match the switches, or change the
switches to match customizations.yaml
. For procedures for either option, see
Configure SNMP.
Note: While the CSM Automatic Network Utility (CANU) will typically not overwrite SNMP settings that are manually applied to the management switches, there are certain cases where SNMP configuration can be over-written or lost (such as when resetting and reconfiguring a switch from factory defaults). To persist the SNMP settings, see CANU Custom Configuration. CANU custom configuration files are used to persist site management network configurations that are intended to take precedence over configurations generated by CANU.
Create a CANU custom configuration that configures SNMP on the management network switches, using the same credentials that were previously used in the Create Baseline System Customization page of the installation. Use this custom configuration with CANU in the next step of the install.
Store the custom configuration in a version control repository along with other configuration assets from the CSM install.
See Configure SNMP for more information about configuring SNMP in CSM.
At this point external connectivity has been established, and either bare-metal configurations can be installed or new/updated configurations can be applied.
Most installations will require the following three tasks, although this may vary depending on site-specific settings and procedures.
See Management Network User Guide for information on next steps for a variety of network configuration scenarios.
Note that the configuration of the management network is an advanced task that may require the help of a networking subject matter expert.
The first nodes to deploy are the management nodes. These Non-Compute Nodes (NCNs) will host CSM services that are required for deploying the rest of the supercomputer.
NOTE
The PIT node will join Kubernetes after it is rebooted later in Deploy final NCN.
Now that deployment of management nodes is complete with initialized Ceph storage and a running Kubernetes cluster on all worker and master nodes, except the PIT node, the CSM services can be installed. The Nexus repository will be populated with artifacts; containerized CSM services will be installed; and a few other configuration steps will be taken.
See Install CSM Services.
After installing all of the CSM services, now validate the health of the management nodes and all CSM services. The reason to do it now is that if there are any problems detected with the core infrastructure or the nodes, it is easy to rewind the installation to Deploy management nodes, because the final NCN has not yet been deployed. In addition, deploying the final NCN successfully requires several CSM services to be working properly.
See Validate CSM Health.
Now that all CSM services have been installed and the CSM health checks completed, with the possible exception of the User Access Service (UAS)/User Access Instance (UAI) tests, the PIT has served its purpose and the final NCN can be deployed. The node used for the PIT will be rebooted, this node will be the final NCN to deploy in the CSM install.
See Deploy Final NCN.
Now that all of the CSM services have been installed and the final NCN has been deployed, administrative access can be prepared. This may include:
cray
) for administrative commandsSee Configure Administrative Access.
Smartmon
metrics on storage NCNsIMPORTANT If performing a fresh install of CSM 1.4.0 or 1.4.1, then skip this step. This step should only be done during installs of CSM 1.4 patch version 1.4.2 or later.
Now that all management nodes have joined the Kubernetes cluster, Ceph should be upgraded and Smartmon
metrics should be enabled on Storage NCNs.
See Upgrade Ceph and enable Smartmon
metrics on storage nodes.
Now that all management nodes have joined the Kubernetes cluster, CSM services have been installed, and administrative access has been enabled, the health of the management nodes and all CSM services should be validated. There are no exceptions to running the tests – all tests should be run now.
This CSM health validation can also be run at other points during the system lifecycle, such as when replacing a management node, checking the health after a management node has rebooted because of a crash, as part of doing a full system power down or power up, or after other types of system maintenance.
See Validate CSM Health.
Now that CSM has been installed and health has been validated, if the system management health monitoring tools (specifically Prometheus) are found to be useful, then email notifications can be configured for specific alerts defined in Prometheus. Prometheus upstream documentation can be leveraged for an Alert Notification Template Reference as well as Notification Template Examples. Currently supported notification types include Slack, Pager Duty, email, or a custom integration via a generic webhook interface.
See Configure Prometheus Email Alert Notifications for an example configuration of an email alert notification for the Postgres replication alerts that are defined on the system.
Now that all management nodes and CSM services have been validated as healthy, the firmware on other components in the system can be checked and updated. The Firmware Action Service (FAS) communicates with many devices on the system. FAS can be used to update the firmware for all of the devices it communicates with at once, or specific devices can be targeted for a firmware update.
After completion of the firmware update with FAS, compute nodes can be prepared. Some compute node types have special preparation steps, but most compute nodes are ready to be used now.
These compute node types require preparation:
The installation of the Cray System Management (CSM) product requires knowledge of the various nodes and switches for the HPE Cray EX system. The procedures in this section should be referenced during the CSM install for additional information on system hardware, troubleshooting, and administrative tasks related to CSM.
See Troubleshooting Installation Problems.
As an optional post installation task, encryption of Kubernetes secrets may be enabled. This enables
at rest encryption of data in the etcd
database used by Kubernetes.
Warning: This process can take multiple hours where Nexus is unavailable and should be done during scheduled maintenance periods.
Prior to the upgrade it is recommended that a Nexus export is taken. This is not a required step but highly recommend to protect the data in Nexus. If there is no maintenance period available then this step should be skipped until after the upgrade process.
See Nexus Export and Restore Procedure for details.
Once installation of CSM has been completed, additional HPE Cray EX software products can be installed via the Install and Upgrade Framework (IUF).
See the Install or upgrade additional products with IUF procedure to continue with the installation of additional HPE Cray EX software products.
For additional information on the IUF, see Install and Upgrade Framework.