Glossary of terms used in CSM documentation.
cray
)A component used by the Configuration Framework Service (CFS) to execute Ansible code from its configuration layers.
For more information, see Ansible Execution Environments.
An application node (AN) is an NCN which is not providing management functions for the HPE Cray EX system. The AN is not part of the Kubernetes cluster to which management nodes belong. One special type of AN is the User Access Node (UAN), but different systems may have need for other types of ANs, such as:
For more information, see ARA Records Ansible (ARA).
Air-Cooled cabinet COTS servers that include a Redfish-enabled baseboard management controller (BMC) and REST endpoint for API control and management. Either IPMI commands or REST API calls can be used to manage a BMC.
Introduced in CSM 1.2, a major feature of CSM is the Bifurcated Customer Access Network. The BICAN is designed to separate administrative network traffic from user network traffic.
For more information, see:
The Slingshot blade switch embedded controller (sC) provides a hardware management REST endpoint to monitor environmental conditions and manage the blade power, switch ASIC, FPGA buffer/interfaces, and firmware.
The Boot Orchestration Service (BOS) is responsible for booting, configuring, and shutting down collections of nodes. This is accomplished using BOS components, such as boot orchestration session templates and sessions. BOS uses other services which provide boot artifact configuration (BSS), power control (PCS), node status (HSM), and configuration (CFS).
The Boot Script Service stores the configuration information that is used to boot each hardware component. Nodes consult BSS for their boot artifacts and boot parameters when nodes boot or reboot.
For more information on the BSS API, see BSS API.
A cabinet cooling group is a group of Olympus cabinets that are connected to a floor-standing Coolant Distribution Unit (CDU). Management network CDU switches in the CDU aggregate all the Node Management Network (NMN) and Hardware Management Network (HMN) connections for the cabinet group.
The Liquid-Cooled Olympus Cabinet Environmental Controller (CEC) sets the cabinet’s geolocation, monitors environmental sensors, and communicates status to the Coolant Distribution Unit (CDU). The CEC microcontroller (eC) signals the cooling distribution unit (CDU) to start liquid cooling and then enables the DC rectifiers so that a chassis can be powered on. The CEC does not provide a REST endpoint on SMNet, it simply provides the cabinet environmental and CDU status to the CMM for evaluation or action; the CEC takes no action. The CEC firmware is flashed automatically when the CMM firmware is flashed. If there are momentary erroneous signals because of a CEC reset or cable disconnection, the system can ride through these events without issuing an EPO.
The CEC microcontroller (eC) sets the cabinet’s geolocation, monitors the cabinet environmental sensors, and communicates cabinet status to the Coolant Distribution Unit (CDU). The eC does not provide a REST endpoint on SMNet as do other embedded controllers, but simply monitors the cabinet sensors and provides the cabinet environmental and CDU status to the CMMs for evaluation and/or action.
The cabinet chassis management module (CMM) provides a REST endpoint via its chassis controller (cC). The CMM is an embedded controller that monitors and controls all the blades in a chassis. Each chassis supports 8 compute blades and 8 switches and associated rectifiers/PSUs in the rectifier shelf. Power Considerations - Two CMMs in adjacent chassis share power from the rectifier shelf (a shelf connects two adjacent chassis - 0 and 1, 2 and 3, 4 and 5, 6 and 7). If both CMMs sharing shelf power are both enabling the rectifiers, one of the CMMs can be removed (but only one at a time) without the rectifier shelf powering off. Removing a CMM will shutdown all compute blades and switches in the chassis. Cooling Considerations - Any single CMM in any cabinet can enable Coolant Distribution Unit (CDU) cooling. Note that the CDU “enable path” has vertical control which means CMMs 0, 2, 4, and 6 and CEC0 are in a path (half of the cabinet), and CMMs 1, 3, 5, and 7 and CEC1 are in another path. Any CMM or CEC in the same half-cabinet path can be removed and CDU cooling will stay enabled as long as the other CMMs/CEC enables CDU cooling.
The compute node (CN) is where high performance computing application are run. These have
hostnames that are of the form nidXXXXXX
, that is, nid
followed by six digits.
where the XXXXXX
is a six digit number starting with zero padding.
NOTE
CRUS was deprecated in CSM 1.2.0 and removed in CSM 1.5.0. See Deprecated Features.
See Rolling Upgrades using BOS.
The Configuration Framework Service (CFS) is available on systems for remote execution and configuration management of nodes and boot images. This includes nodes available in the Hardware State Manager (HSM) service inventory (compute, management, and application nodes), and boot images hosted by the Image Management Service (IMS).
CFS configures nodes and images via a GitOps methodology. All configuration content is stored in the Version Control Service (VCS), and is managed by authorized system administrators. CFS provides a scalable Ansible Execution Environment (AEE) for the configuration to be applied with flexible inventory and node targeting options.
The Content Projection Service (CPS) provides the root filesystem for compute nodes and application nodes in conjunction with the Data Virtualization Service (DVS). Using CPS and DVS, the HPE Cray Programming Environment (CPE) and Analytics products are provided as separately mounted filesystems to compute nodes, application nodes (such as UANs), and worker nodes.
See:
The Cray Advanced Platform Monitoring and Control (CAPMC) service enables direct hardware control of power on/off, power monitoring, or system-wide power telemetry and configuration parameters from Redfish. CAPMC implements a simple interface for powering on/off compute nodes and application nodes, querying node state information, and querying site-specific service usage rules. These controls enable external software to more intelligently manage system-wide power consumption or configuration parameters. CAPMC is replaced by Power Control Service (PCS).
CAPMC was deprecated in CSM 1.5 and may be removed in the future. Power Control Service (PCS) is the replacement for CAPMC.
cray
)The cray
command line interface (CLI) is a framework created to integrate all of the system management
REST APIs into easily usable commands.
HPE Cray Supercomputing Operating System Software (or COS) is a Cray product that may be installed on CSM systems. COS is comprised of COS Base, HPE SUSE Linux Enterprise Operating System (SLE), and User Services Software components.
COS Base software consists of the COS modified kernel and dependent packages.
The Cray Programming Environment is a Cray product that may be installed on CSM systems.
The Cray Security Token Service (STS) generates short-lived Ceph S3 credentials.
For more information on the STS API, see STS Token Generator API.
The Cray Site Init (CSI) program creates, validates, installs, and upgrades an HPE Cray EX system. CSI can prepare the LiveCD for booting the PIT node and then is used from a booted PIT node to do its other functions during an installation. During an upgrade, CSI is installed on one of the nodes to facilitate the CSM software upgrade.
Cray System Management (CSM) refers to the product stream which provides the infrastructure to manage an HPE Cray EX system using Kubernetes to manage the containerized workload of layered micro-services with well-defined REST APIs which provide the ability to discover and control the hardware platform, manage configuration of the system, configure the network, boot nodes, gather log and telemetry data, connect API access and user level access to Identity Providers (IdPs), and provide a method for system administrators and end-users to access the HPE Cray EX system.
CANU is a tool used to generate, validate, and test the network in a CSM environment.
For more information see CSM Automatic Network Utility.
The Customer Access Network (CAN) provides access from outside the customer network to services, Non-Compute Nodes (NCNs), and User Access Nodes (UANs) in the system. This allows for the following:
These nodes and services need an IP address that routes to the customer’s network in order to be accessed from outside the network.
For more information, see:
For more information on the CHN, see Customer Accessible Networks.
For more information on the CMN, see Customer Accessible Networks.
The Data Virtualization Service (DVS) is a distributed network service that projects file systems mounted on Non-Compute Nodes (NCNs) to other nodes within the HPE Cray EX system. Projecting is the process of making a file system available on nodes where it does not physically reside. DVS-specific configuration settings enable clients to access a file system projected by DVS servers. These clients include compute nodes, User Access Nodes (UANs), Thus DVS, while not a file system, represents a software layer that provides scalable transport for file system services. DVS is integrated with the Content Projection Service (CPS).
This Liquid-Cooled Olympus cabinet is a dense compute cabinet that supports 64 compute blades and 64 High Speed Network (HSN) switches.
A Liquid-Cooled TDS cabinet is a dense compute cabinet that supports 2-chassis, 16 compute blades and 16 High Speed Network (HSN) switches, and includes a rack-mounted 4U Coolant Distribution Unit (CDU) (MCDU-4U).
The Slingshot fabric consists of the switches, cables, ports, topology policy, and configuration settings for the Slingshot High-Speed Network.
The Firmware Action Service (FAS) provides an interface for managing firmware versions of Redfish-enabled hardware in the system. FAS interacts with the Hardware State Manager (HSM), device data, and image data in order to update firmware.
A floor-standing Coolant Distribution Unit (CDU) pumps liquid coolant through a cabinet group or cabinet chilled doors.
The hardware management network (HMN) includes HMS embedded controllers. This includes chassis controllers (cC), node controllers (nC) and switch controllers (sC), for Liquid-Cooled TDS and Liquid-Cooled Olympus systems. For standard rack systems, this includes iPDUs, COTS server BMCs, or any other equipment that requires hardware-management with Redfish. The hardware management network is isolated from all other node management networks. An out-of-band Ethernet management switch and hardware management VLAN is used for customer access and administration of hardware.
The Hardware Management Notification Fanout Daemon (HMNFD) service receives component state change notifications from the HSM. It fans notifications out to subscribers (typically compute nodes).
For more information on the HMNFD API, see HMNFD API.
Hardware State Manager (HSM) service monitors and interrogates hardware components in an HPE Cray EX system, tracking hardware state and inventory information, and making it available via REST queries and message bus events when changes occur.
For historical reasons, SMD is also used to refer to the Hardware State Manager.
The Heartbeat Tracker Daemon (HBTD) service listens for heartbeats from components (mainly compute nodes). It tracks changes in heartbeats and conveys changes to the HSM.
For more information on the HBTD API, see HBTD API.
One component of multi-tenancy support.
For more information, see Multi-Tenancy Support.
The High Speed Network (HSN) in an HPE Cray EX system is based on the Slingshot switches.
The Image Management Service (IMS) uses the open source Kiwi-NG tool to build image roots from recipes. IMS also uses CFS to apply image customization for pre-boot configuration of the image root. These images are bootable on compute nodes and application nodes.
The Install and Upgrade Framework (IUF) provides a centralized and consistent method installing and upgrading software on CSM systems. It automates large portions of the install/upgrade processes in order to simplify, optimize, and unify them.
The IUF has an API, but it is only intended to be used by the IUF CLI, not directly by administrators.
For more information, see JSON Web Tokens (JWTs).
The management nodes are a type of Non-Compute Node (NCN). Management nodes provide containerization services as well as storage classes.
The management nodes have various roles:
See Olympus cabinet. Some software and documentation refers to the Olympus cabinet as a Mountain cabinet.
The Mountain Endpoint Discovery Service (MEDS) manages initial discovery, configuration, and geolocation of Redfish-enabled BMCs in Liquid-Cooled Olympus cabinets. It periodically makes Redfish requests to determine if hardware is present or missing.
The NCN Lifecycle Service (NLS) and Install and Upgrade Framework (IUF) services together provide automation in the areas of installing and upgrading software on a system, as well as rolling out that software onto nodes as part of rebuild workflows. The APIs interact with Argo to orchestrate these workflows, so administrators can use the Argo UI to monitor and visualize these automations.
For more information on the NLS API, see NLS API.
The NIC mezzanine card (NMC) attaches to two host port connections on a Liquid-Cooled compute blade node card and provides the High Speed Network (HSN) controllers (NICs). There are typically two or four NICs on each node card. NMCs connect to the rear panel EXAMAX connectors on the compute blade through an internal L0 cable assembly in a single-, dual-, or quad-injection bandwidth configuration depending on the design of the node card.
Each compute blade node card includes an embedded node controller (nC) and REST endpoint to manage the node environmental conditions, power, HMS nFPGA interface, and firmware.
The Node Management Network (NMN) communicates with motherboard PCH-style hosts, typically 10GbE Ethernet LAN-on-motherboard (LOM) interfaces. This network supports node boot protocols (DHCP/TFTP/HTTP), in-band telemetry and event exchange, and general access to management REST APIs.
The Node Memory Dump service is used to interact with node memory dumps.
The non-compute nodes are in the management-plane, these nodes serve infrastructure for microservices (e.g. Kubernetes and storage classes).
For more information, see Non-Compute Nodes.
The Olympus cabinet is a Liquid-Cooled dense compute cabinet that supports 64 compute blades and 64 High Speed Network (HSN) switches. Every HPE Cray EX system with Olympus cabinets will also have at least one River cabinet to house non-compute node components such as management nodes, management network switches, storage nodes, application nodes, and possibly other air-cooled compute nodes. Some software and documentation refers to the Olympus cabinet as a Mountain cabinet.
Parallel Application Launch Service is a Cray product that may be installed on CSM systems.
The Power Control Service (PCS) service enables direct hardware control of power on/off, power status, power capping via Redfish. PCS implements a simple interface for powering on/off compute nodes and application nodes and setting power caps. These controls enable external software to more intelligently manage system-wide power consumption. PCS is the replacement for CAPMC.
The cabinet PDU receives 480VAC 3-phase facility power and provides circuit breaker, fuse protection, and EMI filtered power to the rectifier/power supplies that distribute ±190VDC (HVDC) to a chassis. PDUs are passive devices that do not connect to the SMNet.
The Pre-Install Toolkit (PIT), also known as the Cray Pre-Install Toolkit", provides a framework installing Cray Systems Management. The PIT can be used on any node in the system for recovery and bare-metal discovery, the PIT includes tooling for recovering any non-compute nodes, and can remotely recover other NCNs.
Regarding CSM installations, typically the first Kubernetes master (ncn-m001
) is chosen for
running the PIT during a CSM installation. After CSM is installed, the node running the PIT will be rebooted and deployed via
CSM services before finally joining the running Kubernetes cluster.
The PIT is delivered as a LiveCD, a disk image that can be used to remotely boot a node (e.g. a RemoteISO) or by a USB stick.
The term LiveCD refers to the literal image file that contains the Pre-Install Toolkit.
The term RemoteISO refers to a LiveCD that is remotely mounted on a server. A remotely mounted LiveCD has no persistence, a reboot of a RemoteISO will lose all data/information from the running session.
The rack-mounted Coolant Distribution Unit (CDU) (MCDU-4U) pumps liquid coolant through the Liquid-Cooled TDS cabinet coolant manifolds.
Air-Cooled compute cabinets house a cluster of compute nodes, Slingshot ToR switches, and SMNet ToR switches.
The Redfish Translation Service (RTS) aids in management of any hardware components which are not managed by Redfish, such as a ServerTech PDU in a River cabinet.
At least one 19 inch IEA management cabinet is required for every HPE Cray EX system to support the management nodes, system management network, utility storage, and other support equipment. This cabinet serves as the primary customer access point for managing the system.
The Rosetta ASIC is a 64-port switch chip that forms the foundation for the Slingshot network. Each port can operate at either 100G or 200G. Each network edge port supports IEEE 802.3 Ethernet, optimized-IP based protocols, and portals (an enhanced frame format that supports higher rates of small messages).
The Scalable Boot Projection Service (SBPS) provides the root filesystem for compute nodes and application nodes using iSCSI. In addition, the HPE Cray Programming Environment (CPE) and Analytics products leverage SBPS to provide content to compute nodes and application nodes (such as UANs). CPE and Analytics are provided as separately mounted filesystems that are mounted alongside the root filesystem.
An Air-Cooled service/IO cabinet houses a cluster of NCNs, Slingshot ToR switches, and management network ToR switches to support the managed ecosystem storage, network, user access services (UAS), and other IO services such as LNet and gateways.
The Shasta Cabling Diagram (SHCD) is a multiple tab spreadsheet prepared by HPE Cray Manufacturing with information about the components in an HPE Cray EX system. This document has much information about the system. Included in the SHCD are a configuration summary with revision history, floor layout plan, type and location of components in the air-cooled cabinets, type and location of components in the Liquid-Cooled cabinets, device diagrams for switches and nodes in the cabinets, list of source and destination of every HSN cable, list of source and destination of every cable connected to the spine switches, list of source and destination of every cable connected to the NMN, list of source and destination of every cable connected to the HMN. list of cabling for the KVM, and routing of power to the PDUs.
CSM uses S3 to store a variety of data and artifacts.
Slingshot supports L1 and L2 network connectivity between 200 Gbs switch ports and L0 connectivity from a single 200 Gbs port to two 100 Gbs Mellanox ConnectX-5 NICs. Slingshot also supports edge ports and link aggregation groups (LAG) to external storage systems or networks.
200GBASE-DR4
, 500 meter singlemode fiber200GBASE-SR4
, 100 meter multi-mode fiber200GBASE-CR4
, 3 meter copper cable100GBASE-SR2
, 100 meter multimode fiber100GBASE-CR2
, 3 meter copper cable100GBASE-CR4
, 5 meter copper cable100GBASE-SR4
, 100 meter multi-mode fiberSee also:
The Liquid-Cooled Olympus cabinet blade switch supports one switch ASIC and 48 fabric ports. Eight connectors on the rear panel connect orthogonally to each compute blade then to NIC mezzanine cards (NMCs) inside the compute blade. Each rear panel EXAMAX connector supports two switch ports (a total of 16 fabric ports per blade). Twelve QSFP-DD cages on the front panel (4 fabric ports per QSFP-DD cage), fan out 48 external fabric ports to other switches. The front-panel top ports support passive electrical cables (PEC) or active optical cables (AOC). The front-panel bottom ports support only PECs for proper cooling in the blade enclosure.
Slingshot Host Software is a Cray product that may be installed on CSM systems to support Slingshot.
A standard River cabinet can support one, two, or four, rack-mounted Slingshot ToR switches. Each switch supports a total of 64 fabric ports. 32 QSFP-DD connectors on the front panel connect 64 ports to the fabric. All front-panel connectors support either passive electrical cables (PEC) or active optical cables (AOC).
Manual coolant supply and return shutoff valves at the top of each cabinet can be closed to isolate a single cabinet from the other cabinets in the cooling group for maintenance. If the valves are closed during operation, the action automatically causes the CMMs to remove ±190VDC from each chassis in the cabinet because of the loss of coolant pressure.
System Admin Toolkit (SAT) provides the sat
command line interface which interacts
with the REST APIs of many services to perform more complex system management tasks.
For more information, see System Admin Toolkit.
The System Configuration Service (SCSD) allows administrators to set various BMC and controller parameters. These parameters are typically set during discovery, but this tool enables parameters to be set before or after discovery.
For more information on the SCSD API, see SCSD API.
The System Diagnostic Utility is a Cray product that may be installed on CSM systems to provide diagnostic tools.
The System Layout Service (SLS) serves as a “single source of truth” for the system design. It details the physical locations of network hardware, management nodes, application nodes, compute nodes, and cabinets. It also stores information about the network, such as which port on which switch should be connected to each node.
The System Management Network (SMNet) is a dedicated out-of-band (OOB) spine-leaf topology Ethernet network that interconnects all the nodes in the system to management services.
System Management Services (SMS) leverages open REST APIs, Kubernetes container orchestration, and a pool of commercial off-the-shelf (COTS) servers to manage the system. The management server pool, custom Redfish-enabled embedded controllers, iPDU controllers, and server BMCs are unified under a common software platform that provides 3 levels of management: Level 1 HaaS, Level 2 IaaS, and Level 3 PaaS.
System Management Services (SMS) nodes provide access to the entire management cluster and Kubernetes container orchestration.
The System Monitoring Application (SMA) is one of the services that collects CSM system data for administrators.
Another name for the System Monitoring Application (SMA) Framework.
One component of multi-tenancy support.
The air-Cooled cabinet HSN ToR switch embedded controller (sC-ToR) provides a hardware management REST endpoint to monitor the ToR switch environmental conditions and manage the switch power, HSN ASIC, and FPGA interfaces.
The User Access Node (UAN) is an NCN, but is really one of the special types of application nodes. The UAN provides a traditional multi-user Linux environment for users on a Cray Ex system to develop, build, and execute their applications on the HPE Cray EX compute node. Some sites refer to their UANs as Login nodes.
HPE Cray Supercomputing User Services Software (or USS) contains user space packages, kernel modules, microservices, configuration content, and other components. USS adds content on top of COS Base (the modified COS kernel) without modifying the kernel directly.
The Version Control Service (VCS) provides configuration content to CFS via a GitOps methodology
based on a git
server (gitea
) that can be accessed by the git
command but also includes a
web interface for repository management, pull requests, and a visual view of all repositories
and organizations.
For more information, see Version Control Service.
The Virtual Network Identifier Daemon is part of the Cray Slingshot product that may be installed on CSM systems.
Component names (xnames) identify the geolocation for hardware components in the HPE Cray EX system. Every component is uniquely identified by these component names. Some, like the system cabinet number or the Coolant Distribution Unit (CDU) number, can be changed by site needs. There is no geolocation encoded within the cabinet number, such as an X-Y coordinate system to relate to the floor layout of the cabinets. Other component names refer to the location within a cabinet and go down to the port on a card or switch or the socket holding a processor or a memory DIMM location.
For more information, see Component Names (xnames).