Cray System Management Documentation > Cray System Management (CSM) Administration Guide > node management > Node Management Workflows

Node Management Workflows

The following workflows are intended to be high-level overviews of node management tasks. These workflows depict how services interact with each other during node management and help to provide a quicker and deeper understanding of how the system functions.

The workflows and procedures in this section include:

Add Nodes
Remove Nodes
Replace Nodes
Move Nodes

Add Nodes

Add a Standard Rack Node

Use Cases: Administrator permanently adds select compute nodes to expand the system.

Components: This workflow is based on the interaction of the System Layout Service (SLS) with other hardware management services (HMS).

Mentioned in this workflow:

System Layout Service (SLS) serves as a “single source of truth” for the system design. It details the physical locations of network hardware, compute nodes and cabinets. Further, it stores information about the network, such as which port on which switch should be connected to each compute node.
Hardware State Manager (HSM) monitors and interrogates hardware components in an HPE Cray EX system, tracking hardware state and inventory information, and making it available via REST queries and message bus events when changes occur.
HMS Notification Fanout Daemon (hmnfd) receives component state change notifications from the HSM. It fans notifications out to subscribers (typically compute nodes).
Endpoint Discovery Service (REDS/MEDS) manages initial discovery, configuration, and geolocation of Redfish-enabled BMCs. It periodically makes Redfish requests to determine if hardware is present or missing.
Heartbeat Tracker Service (hbtd) listens for heartbeats from components (mainly compute nodes). It tracks changes in heartbeats and conveys changes to HSM.

Add Node Workflow

Workflow Overview: The following sequence of steps occur during this workflow.

Administrator updates SLS

Administrator creates a new hardware entry for the select component names (xnames) in SLS. Enter the node component names (xnames) in the SLS input file.
Administrator adds compute nodes

The Administrator physically adds select compute nodes and powers them on. Because the nodes are unknown, the DHCP and TFTP servers give it the special initialization ramdisk. The compute nodes performs local configuration.

The following steps (3-11) occur automatically as different APIs interact with each other.
Set BMC credentials

The compute node requests per-node BMC credentials. This message must include the MAC address of the BMC. A new set of credentials is generated by the discovery service.

Once the compute node is powered on, initialized, and discovered, REDS gets details about the new node like IP address, MAC address, sets the username and password for a BMC, state etc.
REDS/MEDS to SLS

REDS/MEDS query SLS database for information about the new node.

For example: “What component name (xname) is connected to port XX on switch Y?”
SLS to REDS/MEDS

SLS updates the discovery service with the new compute node and its component name (xname).

For example: “xname x0c0… is connected to port XX”.
REDS/MEDS to HSM

Discovery services update HSM about the new Redfish endpoint for the node. Details like component name (xname) and IP address of the new node are updated in HSM.

For example: “x0c0… at IP address AAA.BBB.CCC.DDD”
HSM to SLS

HSM queries SLS for NID and role assignments for the new node.
SLS to HSM

HSM updates the Nodemap based on information received from SLS.
Node to Heartbeat Tracker Service

The Heartbeat Tracker Service receives heartbeats from the new compute node after the node is powered on.
Heartbeat Tracker Service to HSM

The Heartbeat Tracker Service report the heartbeat status to HSM.
HSM to HMNFD

HSM sends the new compute node state information with State as ON to HMNFD. HMNFD fans out these notifications to the subscribing compute nodes.

Remove Nodes

Use Cases: Administrator permanently removes select compute nodes to contract the system.

Components: This workflow is based on the interaction of the System Layout Service (SLS) with other hardware management services (HMS).

Mentioned in this workflow:

System Layout Service (SLS) serves as a “single source of truth” for the system design. It details the physical locations of network hardware, compute nodes and cabinets. Further, it stores information about the network, such as which port on which switch should be connected to each compute node.
Hardware State Manager (HSM) monitors and interrogates hardware components in an HPE Cray EX system, tracking hardware state and inventory information, and making it available via REST queries and message bus events when changes occur.
HMS Notification Fanout Daemon (hmnfd) receives component state change notifications from the HSM. It fans notifications out to subscribers (typically compute nodes).
Endpoint Discovery Service (REDS/MEDS) manages initial discovery, configuration, and geolocation of Redfish-enabled BMCs. It periodically makes Redfish requests to determine if hardware is present or missing.
Heartbeat Tracker Service (hbtd) listens for heartbeats from components (mainly compute nodes). It tracks changes in heartbeats and conveys changes to HSM.

Remove Node Workflows

Workflow Overview: The following sequence of steps occur during this workflow.

Administrator updates SLS

Administrator deletes the node entries with the specific component name (xname) from SLS. Note that if deleting a parent object, then the children are also deleted from SLS. If the child object happens to be a parent, then the deletion can cascade down levels. If deleting a child object, it does not affect the parent.
Administrator physically removes the compute nodes

The Administrator powers off and physically removes the compute nodes.

The following steps (3-9) occur automatically as different APIs interact with each other.
No heartbeats

The Heartbeat Tracker Service stops receiving heartbeats and marks the nodes status as standby and then off as per Redfish event.

Standby status implies that the node is no longer ready and presumed dead. It typically means that the heartbeat is lost. Off status implies that the location is not populated with a component.
Heartbeat Tracker Service to HSM

The Heartbeat Tracker Service reports the heartbeat status to HSM.
REDS/MEDS detect no BMC

The discovery service detects that the BMC is not there.
REDS/MEDS to SLS

REDS/MEDS query SLS database for information about the missing BMCs.
SLS to REDS/MEDS

SLS updates the discovery service that the BMC was removed.
REDS/MEDS to HSM

Discovery services update HSM that the BMC Redfish endpoints for the nodes were removed. HSM marks the state of BMCs and the nodes as empty.

Empty state implies that the location is not populated with a component.
HSM to HMNFD

HSM sends the compute node state information with State as empty to HMNFD. HMNFD fans out this notification to the subscribing compute nodes.

Node Management Workflows

Add Nodes

Remove Nodes

Replace Nodes

Move Nodes