Cray System Management Documentation > Cray System Management (CSM) Administration Guide

Cray System Management (CSM) Administration Guide

The Cray System Management (CSM) operational activities are administrative procedures required to operate an HPE Cray EX system with CSM software installed.

The following administrative topics can be found in this guide:

CSM product management
Bare-metal
Image management
Boot orchestration
System power off procedures
System power on procedures
Power management
Artifact management
Compute rolling upgrades
Configuration management
Kubernetes
Package repository management
Security and authentication
Resiliency
ConMan
Utility storage
System management health
System Layout Service (SLS)
System configuration service
Hardware State Manager (HSM)
Hardware Management (HM) collector
HPE Power Distribution Unit (PDU)
Node management
Network
- Management network
- Customer accessible networks (CMN/CAN/CHN)
- Dynamic Host Configuration Protocol (DHCP)
- Domain Name Service (DNS)
- External DNS
- MetalLB in BGP-mode
Spire
Update firmware with FAS
User Access Service (UAS)
System Admin Toolkit (SAT)
Install and Upgrade Framework (IUF)
Backup and recovery
Multi-tenancy

CSM product management

Important procedures for configuring, managing, and validating the CSM environment.

Bare-metal

General information on what needs to be done before the initial install of CSM.

Image management

Build and customize image recipes with the Image Management Service (IMS).

Boot orchestration

Use the Boot Orchestration Service (BOS) to boot, configure, and shut down collections of nodes.

System power off procedures

Procedures required for a full power off of an HPE Cray EX system.

System Power Off Procedures

Additional links to power off sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:

Prepare the System for Power Off
Shut Down and Power Off Compute and User Access Nodes
Save Management Network Switch Configuration Settings
Power Off Compute Cabinets
- Power Off Compute Cabinets using CAPMC
- Power Off Compute Cabinets using PCS
Shut Down and Power Off the Management Kubernetes Cluster
Power Off the External Lustre File System

System power on procedures

Procedures required for a full power on of an HPE Cray EX system.

System Power On Procedures

Additional links to power on sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:

Power On and Start the Management Kubernetes Cluster
Power On Compute Cabinets
- Power On Compute Cabinets using CAPMC
- Power On Compute Cabinets using PCS
Power On the External Lustre File System
Power On and Boot Compute and User Access Nodes
Recover from a Liquid Cooled Cabinet EPO Event
- Recover from a Liquid Cooled Cabinet EPO Event using CAPMC
- Recover from a Liquid Cooled Cabinet EPO Event using PCS

Power management

HPE Cray System Management (CSM) software manages and controls power out-of-band through Redfish APIs.

Artifact management

Use the Ceph Object Gateway Simple Storage Service (S3) API to manage artifacts on the system.

Compute rolling upgrades

Upgrade sets of compute nodes with the Compute Rolling Upgrade Service (CRUS) without requiring an entire set of nodes to be out of service at once. CRUS enables administrators to limit the impact on production caused from upgrading compute nodes by working through one step of the upgrade process at a time.

NOTES

CRUS was deprecated in CSM 1.2.0 and it will be removed in CSM 1.5.0. See Deprecated Features.

The CRUS subcommands are mistakenly absent from the Cray CLI in CSM 1.4.0. See CRUS Subcommands Missing From Cray CLI.

Configuration management

The Configuration Framework Service (CFS) is available on systems for remote execution and configuration management of nodes and boot images.

Kubernetes

The system management components are broken down into a series of micro-services. Each service is independently deployable, fine-grained, and uses lightweight protocols. As a result, the system’s micro-services are modular, resilient, and can be updated independently. Services within the Kubernetes architecture communicate using REST APIs.

Package repository management

Repositories are added to systems to extend the system functionality beyond what is initially delivered. The Sonatype Nexus Repository Manager is the primary method for repository management. Nexus hosts the Yum, Docker, raw, and Helm repositories for software and firmware content.

Security and authentication

Mechanisms used by the system to ensure the security and authentication of internal and external requests.

Resiliency

HPE Cray EX systems are designed so that system management services (SMS) are fully resilient and that there is no single point of failure.

ConMan

ConMan is a tool used for connecting to remote consoles and collecting console logs. These node logs can then be used for various administrative purposes, such as troubleshooting node boot issues.

Utility storage

Ceph is the utility storage platform that is used to enable pods to store persistent data. It is deployed to provide block, object, and file storage to the management services running on Kubernetes, as well as for telemetry data coming from the compute nodes.

System management health

Enable system administrators to assess the health of their system. Operators need to quickly and efficiently troubleshoot system issues as they occur and be confident that a lack of issues indicates the system is operating normally.

System Layout Service (SLS)

The System Layout Service (SLS) holds information about the system design, such as the physical locations of network hardware, compute nodes, and cabinets. It also stores information about the network, such as which port on which switch should be connected to each compute node.

System configuration service

The System Configuration Service (SCSD) allows administrators to set various BMC and controller parameters. These parameters are typically set during discovery, but this tool enables parameters to be set before or after discovery. The operations to change these parameters are available in the Cray CLI under the scsd command.

Hardware State Manager (HSM)

Use the Hardware State Manager (HSM) to monitor and interrogate hardware components in the HPE Cray EX system, tracking hardware state and inventory information, and making it available via REST queries and message bus events when changes occur.

Hardware Management (HM) collector

The Hardware Management (HM) Collector is used to collect telemetry and Redfish events from hardware in the system.

Adjust HM Collector resource limits and requests

HPE Power Distribution Unit (PDU)

Procedures for managing and setting up HPE PDUs.

HPE PDU Admin Procedure

Node management

Monitor and manage compute nodes (CNs) and non-compute nodes (NCNs) used in the HPE Cray EX system.

Network

Overview of the several different networks supported by the HPE Cray EX system.

Management network

HPE Cray EX systems can have network switches in many roles: spine switches, leaf switches, LeafBMC switches, and CDU switches. Newer systems have HPE Aruba switches, while older systems have Dell and Mellanox switches. Switch IP addresses are generated by Cray Site Init (CSI).

Customer accessible networks (CMN/CAN/CHN)

The customer accessible networks (CMN/CAN/CHN) provide access from outside the customer network to services, NCNs, and User Access Nodes (UANs) in the system.

Dynamic Host Configuration Protocol (DHCP)

The DHCP service on the HPE Cray EX system uses the Internet Systems Consortium (ISC) Kea tool. Kea provides more robust management capabilities for DHCP servers.

Domain Name Service (DNS)

The central DNS infrastructure provides the structural networking hierarchy and datastore for the system.

External DNS

External DNS, along with the Customer Management Network (CMN), Border Gateway Protocol (BGP), and MetalLB, makes it simpler to access the HPE Cray EX API and system management services. Services are accessible directly from a laptop without needing to tunnel into a non-compute node (NCN) or override /etc/hosts settings.

MetalLB in BGP-mode

MetalLB is a component in Kubernetes that manages access to LoadBalancer services from outside the Kubernetes cluster. There are LoadBalancer services on the Node Management Network (NMN), Hardware Management Network (HMN), and Customer Access Network (CAN).

MetalLB can run in either Layer2-mode or BGP-mode for each address pool it manages. BGP-mode is used for the NMN, HMN, and CAN. This enables true load balancing (Layer2-mode does failover, not load balancing) and allows for a more robust layer 3 configuration for these networks.

Spire

Spire provides the ability to authenticate nodes and workloads, and to securely distribute and manage their identities along with the credentials associated with them.

Update firmware with FAS

The Firmware Action Service (FAS) provides an interface for managing firmware versions of Redfish-enabled hardware in the system. FAS interacts with the Hardware State Manager (HSM), device data, and image data in order to update firmware.

See Update Firmware with FAS for a list components that are upgradable with FAS. Refer to the HPC Firmware Pack (HFP) product stream to update firmware on other components.

User Access Service (UAS)

The User Access Service (UAS) is a containerized service managed by Kubernetes that enables application developers to create and run user applications. Users launch a User Access Instance (UAI) using the cray command. Users can also transfer data between the Cray system and external systems using the UAI.

System Admin Toolkit (SAT)

The System Admin Toolkit (SAT) is a command-line interface that can assist administrators with common tasks, such as troubleshooting and querying information about the HPE Cray EX System, system boot and shutdown, and replacing hardware components. In CSM 1.3 and newer, the sat command is available on the Kubernetes NCNs without installing the SAT product stream.

System Admin Toolkit in CSM

Install and Upgrade Framework (IUF)

The Install and Upgrade Framework (IUF) provides a CLI and API which automates operations required to install, upgrade and deploy non-CSM product content onto an HPE Cray EX system. Each product distribution includes an iuf-product-manifest.yaml file which IUF uses to determine what operations are needed to install, upgrade, and deploy the product.

Backup and recovery

Information on how to perform backups of individual services or the entire system, and how to restore from these backups.

System Recovery
etcd
Postgres
- Restore Postgres
- Disaster Recovery for Postgres
Nexus
- Nexus Export and Restore
- Nexus Service Recovery
Keycloak
- Create a Backup of the Keycloak Postgres Database
- Keycloak Service Recovery
Vault
- Backup and Restore Vault Clusters
- Vault Service Recovery
SLS
HSM
Spire
Version Control Service (VCS)
- Backup and Restore VCS Data
Boot Orchestration Service (BOS)
- Exporting and Importing BOS Data
Boot Script Service (BSS)
- Exporting and Importing BSS Data
Configuration Management Service (CFS)
- Exporting and Importing CFS Data
Image Management Service (IMS)
- Exporting and Importing IMS Data
Workload managers
- PBS Service Recovery
- Slurm Service Recovery