Rack Resiliency (RR)

Introduction

HPE Cray Supercomputing EX systems are designed to maintain high availability (HA) for critical services, even if management nodes fail. However, rack-level failures can cause service disruptions if management nodes are concentrated within a single rack. This can result in the loss of HA quorum. Additionally, incorrect physical placement or software configuration of storage nodes can cause utility storage service disruptions due to rack-level failures.

To address these issues, CSM 1.7.0 introduces the Rack Resiliency feature, which provides management rack level resiliency to maintain HA of critical management services due to a single rack failure. This feature prevents system-wide outages, allowing for successful execution of user jobs or scheduling new ones.

NOTE:

  • Rack Resiliency is disabled by default.
  • Rack Resiliency can be enabled only during fresh install of CSM 1.7 or an upgrade from CSM 1.6 to CSM 1.7.
  • Rack Resiliency cannot be disabled after it has been enabled during the install or upgrade.

Terminology and components

Rack Resiliency Service (RRS)

RRS is the implementation of the Rack Resiliency feature in CSM.

For details on the Kubernetes deployment of RRS, see cray-rrs Deployment.

Racks

Physical racks

Racks are a standardized physical structure designed to house and organize computer servers and other hardware like network switches. Each HPE Cray Supercomputing EX system rack houses NCNs and non-NCNs, along with Slingshot switches. Racks are also referred to as cabinets.

Ceph racks

In the Rack Resiliency lexicon, the term “rack” also has a second meaning; a rack can refer to a logical, hierarchical Ceph bucket in the CRUSH map. A Ceph rack groups together hosts or nodes that are located in the same physical rack.

Placement

Placement refers to the physical arrangement of nodes across racks.

Failure domains

In the context of CSM, a failure domain is the minimum infrastructure that provides high availability for CSM services.

Zones

A Zone in the Rack Resiliency solution is a logical representation of a failure domain. There are two varieties of zone in Rack Resiliency: [Kubernetes zones] and [Ceph Zones].

Critical services

In the context of Rack Resiliency, Critical Services are those Kubernetes services that are necessary for the execution of user jobs.

ConfigMaps

Rack Resiliency uses Kubernetes ConfigMaps to store information about critical services and zones. They are also used to provide the configuration parameters for the Resiliency Monitoring Service.

For details, see ConfigMaps.

Kyverno policy

Rack Resiliency uses a Kyverno policy to ensure that critical services survive the failure of nodes or a single rack. This is accomplished by spreading out the replicas of the critical services across multiple zones.

See Kyverno Policy for more information.

Resiliency Monitoring Service (RMS)

The Resiliency Monitoring Service (RMS) is a part of the Rack Resiliency Service. The RMS provides the functionality to detect rack or node failures and monitor critical services.

For more information, see Resiliency Monitoring Service.

RRS CLI

The Cray CLI has been updated to support the Rack Resiliency Service. The RRS CLI has subcommands for the following:

RRS API

The RRS RESTful API is used by the RRS CLI and also can be accessed using tools like curl. See Rack Resiliency Service v1 for more information.

Architecture

Rack Resiliency solution overview

The Rack Resiliency solution is implemented in multiple stages. These stages are:

  1. Enabling Rack Resiliency
  2. Setup of Rack Resiliency
  3. Resiliency Monitoring Service

Troubleshooting

For information on how to troubleshoot RRS, see Troubleshooting.