Install CSM

Installation of the CSM product stream has many steps in multiple procedures which should be done in a specific order. Information about the HPE Cray EX system and the site is used to prepare the configuration payload. The initial node used to bootstrap the installation process is called the PIT node because the Pre-Install Toolkit is installed there. Once the management network switches have been configured, the other management nodes can be deployed with an operating system and the software to create a Kubernetes cluster utilizing Ceph storage. The CSM services provide essential software infrastructure including the API gateway and many micro-services with REST APIs for managing the system. Once administrative access has been configured, the installation of CSM software and nodes can be validated with health checks before doing operational tasks like the check and update of firmware on system components or the preparation of compute nodes. Once the CSM installation has completed, other product streams for the HPE Cray EX system can be installed.

Topics

  1. Validate management network cabling
  2. Prepare configuration payload
  3. Prepare management nodes
  4. Bootstrap PIT node
  5. Configure management network switches
  6. Collect MAC addresses for NCNs
  7. Deploy management nodes
  8. Install CSM services
  9. Validate CSM health
  10. Redeploy PIT node
  11. Configure administrative access
  12. Validate CSM health
  13. Configure Prometheus alert notifications
  14. Update firmware with FAS
  15. Prepare compute nodes
  16. Apply security hardening
  17. Next topic

The topics in this chapter need to be done as part of an ordered procedure so are shown here with numbered topics.

Troubleshooting installation problems

The installation of the Cray System Management (CSM) product requires knowledge of the various nodes and switches for the HPE Cray EX system.

For additional installation-specific troubleshooting information, see Troubleshooting Installation Problems. Some topics also have supplementary troubleshooting sections listed in the CSM Operations index.

Details

  1. Validate management network cabling

    The cabling should be validated between the nodes and the management network switches. The information in the Shasta Cabling Diagram (SHCD) can be used to confirm the cables which physically connect components of the system. Having the data in the SHCD which matches the physical cabling will be needed later in both Prepare configuration payload and Configure management network switches.

    See Validate Management Network Cabling.

    Note: If a reinstall or fresh install of this software release is being done on this system and the management network cabling has already been validated, then skip this step and move to Prepare configuration payload.

  2. Prepare configuration payload

    Information gathered from a site survey is needed to feed into the CSM installation process, such as system name, system size, site network information for the CAN, site DNS configuration, site NTP configuration, network information for the node used to bootstrap the installation. Much of the information about the system hardware is encapsulated in the SHCD (Shasta Cabling Diagram), which is a spreadsheet prepared by HPE Cray Manufacturing to assemble the components of the system and connect appropriately labeled cables.

    See Prepare Configuration Payload

  3. Prepare management nodes

    Some preparation of the management nodes might be needed before starting an install or reinstall. The preparation includes checking and updating the firmware on the PIT node, quiescing the compute nodes and application nodes, scaling back DHCP on the management nodes, wiping the storage on the management nodes, powering off the management nodes, and possibly powering off the PIT node.

    See Prepare Management Nodes.

  4. Bootstrap PIT node

    The Pre-Install Toolkit (PIT) node needs to be bootstrapped from the LiveCD. There are two media available to bootstrap the PIT node–the RemoteISO or a bootable USB device. The recommended media is the RemoteISO, because it does not require any physical media to prepare. However, remotely mounting an ISO on a BMC does not work smoothly for nodes from all vendors. It is recommended to try the RemoteISO first.

    Use one of these procedures to bootstrap the PIT node from the LiveCD:

    Using the LiveCD USB method requires a USB 3.0 device with at least 1TB of space to create a bootable LiveCD.

  5. Configure management network switches

    Now that the PIT node has been booted with the LiveCD environment and CSI has generated the switch IP addresses, the management network switches can be configured. This procedure will configure the spine switches, aggregation switches (if present), CDU switches (if present), and the leaf switches.

    See Configure Management Network Switches.

    Note: If a reinstall of this software release is being done on this system and the management network switches have already been configured, then skip this step and move to Collect MAC addresses for NCNs.

  6. Collect MAC addresses for NCNs

    Now that the PIT node has been booted with the LiveCD and the management network switches have been configured, the actual MAC address for the management nodes can be collected. This process will include repetition of some of the steps done up to this point because csi config init will need to be run with the proper MAC addresses.

    See Collect MAC Addresses for NCNs.

    Note: If a reinstall of this software release is being done on this system and the ncn_metadata.csv file already had valid MAC addresses for both BMC and node interfaces before csi config init was run, then this topic could be skipped and instead move to Deploy management nodes.

    Note: If a first time install of this software release is being done on this system and the ncn_metadata.csv file already had valid MAC addresses for both BMC and node interfaces before csi config init was run, then this topic could be skipped and instead move to Deploy management nodes.

  7. Deploy management nodes

    Now that the PIT node has been booted with the LiveCD and the management network switches have been configured, the other management nodes can be deployed. This procedure will boot all of the management nodes, initialize Ceph storage on the storage nodes and start the Kubernetes cluster on all of the worker nodes and the master nodes, except for the PIT node. The PIT node will join Kubernetes after it is rebooted later in Redeploy PIT node.

    See Deploy Management Nodes.

  8. Install CSM services

    Deployment of management nodes is complete with initialized Ceph storage and a running Kubernetes cluster on all worker and master nodes, except the PIT node. The Nexus repository will be populated with artifacts, containerized CSM services will be installed, and a few other configuration steps will be taken.

    See Install CSM Services.

  9. Validate CSM health

    Validate the health of the management nodes and all CSM services. The reason to do it now is that if there are any problems detected with the core infrastructure or the nodes, it is easy to rewind the installation to Deploy management nodes, because the PIT node has not yet been rebooted. In addition, rebooting the PIT node and deploying the final NCN successfully requires several CSM services to be working properly, so validating this is important.

    Note: At this point of the install, the cray CLI has not yet been configured. Some of the tests (Hardware State Manager Discovery Validation, Booting the CSM Barebones Image on compute nodes, UAS/UAI) require it to be configured in order to run. These tests may be skipped until after the PIT node has been redeployed, but this is not recommended.

    To enable the cray CLI in order to execute those tests, follow these two procedures before performing the CSM health validation:

    1. Configure Keycloak Account
    2. Configure the cray Command Line Interface (CLI)

    To run the CSM health checks, see Validate CSM Health.

  10. Redeploy PIT node

    Now that all CSM services have been installed and the CSM health checks completed, with the possible exception of Booting the CSM Barebones Image and the UAS/UAI tests, the PIT node can be rebooted to leave the LiveCD environment and assume its intended role as one the Kubernetes master nodes.

    See Redeploy PIT Node.

  11. Configure administrative access

    Now that all of the CSM services have been installed and the PIT node has been redeployed, administrative access can be prepared. This may include configuring Keycloak with a local Keycloak account or confirming Keycloak is properly federating LDAP or other Identity Provider (IdP), initializing the Cray command line interface for administrative commands, locking the management nodes from accidental actions such as firmware updates by FAS or power actions by CAPMC, configuring the CSM layer of configuration by CFS in NCN personalization,and configuring the node BMCs (node controllers) for nodes in liquid cooled cabinets.

    See Configure Administrative Access.

  12. Validate CSM health

    Now that all management nodes have joined the Kubernetes cluster, CSM services have been installed, and administrative access has been enabled, the health of the management nodes and all CSM services should be validated. There are no exceptions to running the tests–all can be run now.

    This CSM health validation can also be run at other points during the system lifecycle, such as when replacing a management node, checking the health after a management node has rebooted because of a crash, as part of doing a full system power down or power up, or after other types of system maintenance.

    See Validate CSM Health.

  13. Configure Prometheus alert notifications

    Now that CSM has been installed and health has been validated, if the system management health monitoring tools and specifically, Prometheus, are found to be useful, email notifications can be configured for specific alerts defined in Prometheus. Prometheus upstream documentation can be leveraged for an Alert Notification Template Reference as well as Notification Template Examples. Currently supported notification types include Slack, Pager Duty, email, or a custom integration via a generic webhook interface.

    See Configure Prometheus Email Alert Notifications for an example configuration of an email alert notification for the Postgres replication alerts that are defined on the system.

  14. Update firmware with FAS

    Now that all management nodes and CSM services have been validated as healthy, the firmware on other components in the system can be checked and updated. The Firmware Action Service (FAS) communicates with many devices on the system. FAS can be used to update the firmware for all of the devices it communicates with at once, or specific devices can be targeted for a firmware update.

    IMPORTANT: Before FAS can be used to update firmware, refer to the 1.5 HPE Cray EX System Software Getting Started Guide S-8000 on the HPE Customer Support Center for information about how to install the HPE Cray EX HPC Firmware Pack (HFP) product. The installation of HFP will inform FAS of the newest firmware available. Once FAS is aware that new firmware is available, then see Update Firmware with FAS.

  15. Prepare compute nodes

    After completion of the firmware update with FAS, compute nodes can be prepared. Some compute node types have special preparation steps, but most compute nodes are ready to be used now.

    These compute node types require preparation:

    • HPE Apollo 6500 XL645D Gen10 Plus
    • Gigabyte

    See Prepare Compute Nodes

  16. Apply security hardening

    After preparing compute nodes, and prior to the installation of other product streams, review the security hardening guide.

    See Security Hardening

  17. Next topic

    After completion of the firmware update with FAS and the preparation of compute nodes, the CSM product stream has been fully installed and configured. Refer to the 1.5 HPE Cray EX System Software Getting Started Guide S-8000 on the HPE Customer Support Center for more information on other product streams to be installed and configured after CSM.