HPE Cray Supercomputing EX systems are designed to maintain high availability (HA) for critical services, even if management nodes fail. However, rack-level failures can cause service disruptions if management nodes are concentrated within a single rack. This can result in the loss of HA quorum. Additionally, incorrect physical placement or software configuration of storage nodes can cause utility storage service disruptions due to rack-level failures.
To address these issues, CSM 1.7.0 introduces the Rack Resiliency feature, which provides management rack level resiliency to maintain HA of critical management services due to a single rack failure. This feature prevents system-wide outages, allowing for successful execution of user jobs or scheduling new ones.
NOTE:
RRS is the implementation of the Rack Resiliency feature in CSM.
For details on the Kubernetes deployment of RRS, see cray-rrs
Deployment.
Racks are a standardized physical structure designed to house and organize computer servers and other hardware like network switches. Each HPE Cray Supercomputing EX system rack houses NCNs and non-NCNs, along with Slingshot switches. Racks are also referred to as cabinets.
In the Rack Resiliency lexicon, the term “rack” also has a second meaning; a rack can refer to a logical, hierarchical Ceph bucket in the CRUSH map. A Ceph rack groups together hosts or nodes that are located in the same physical rack.
Placement refers to the physical arrangement of nodes across racks.
In the context of CSM, a failure domain is the minimum infrastructure that provides high availability for CSM services.
A Zone in the Rack Resiliency solution is a logical representation of a failure domain. There are two varieties of zone in Rack Resiliency: [Kubernetes zones] and [Ceph Zones].
In the context of Rack Resiliency, Critical Services are those Kubernetes services that are necessary for the execution of user jobs.
Rack Resiliency uses Kubernetes ConfigMaps to store information about critical services and zones. They are also used to provide the configuration parameters for the Resiliency Monitoring Service.
For details, see ConfigMaps.
Rack Resiliency uses a Kyverno policy to ensure that critical services survive the failure of nodes or a single rack. This is accomplished by spreading out the replicas of the critical services across multiple zones.
See Kyverno Policy for more information.
The Resiliency Monitoring Service (RMS) is a part of the Rack Resiliency Service. The RMS provides the functionality to detect rack or node failures and monitor critical services.
For more information, see Resiliency Monitoring Service.
The Cray CLI has been updated to support the Rack Resiliency Service. The RRS CLI has subcommands for the following:
The RRS RESTful API is used by the RRS CLI and also can be accessed using tools like curl
.
See Rack Resiliency Service v1 for more information.
The Rack Resiliency solution is implemented in multiple stages. These stages are:
For information on how to troubleshoot RRS, see Troubleshooting.