A zone in the Rack Resiliency solution is Failure domain. Specifically, it is a representation of a Management Plane Failure Domain (MPFD). In general for CSM, a MPFD constitutes one or more racks that contain CSM management nodes. Managed nodes are not considered in MPFDs.
In Rack Resiliency, if placement validation is successful, then a MPFD will always consist of a single rack. That rack will contain, at minimum, 1 Kubernetes master NCN, 1 Kubernetes worker NCN, and 1 Ceph storage NCN.
Rack Resiliency maps its zones into both the Kubernetes cluster and Ceph cluster.
By default, Rack Resiliency zone names will be the component name (xname) of
the associated physical rack (which will be the same as the first 5 characters
of the xnames of the associated NCNs. For example, x3000
or x3001
.
When first Enabling Rack Resiliency, administrators can optionally specify
prefixes to be used for Kubernetes zone names, Ceph zone names, or both. These prefixes will be prepended
to the default zone names described above, separated by a dash (-
) character (e.g. myprefix-x3002
).
NOTE: Prefixes can only be specified during the initial enablement process. They cannot be changed, removed, or set later.
One reason an administrator may wish to do this is because xnames are not unique across different CSM systems. Because of this, an administrator may wish to use a system-specific identifier as a zone prefix, to differentiate between zones on different systems.
For Ceph zones specifically, an administrator may need to do this if their Ceph cluster already has any bucket whose name would collide with the default zone names (Ceph bucket names are required to be unique, and as part of zoning, Ceph buckets are created for each zone).
Zone name prefixes are not required. If a non-empty zone name prefix is specified, then it must conform to some restrictions.
Zone names overall are required to be between no longer than 63 characters. Because the base zone names
are always 5 characters long, and accounting for the -
separator, this means that zone prefix names
must be no more than 57 characters long.
In addition, zone prefixes must obey the following restrictions:
a-z0-9
)-
) and dot (.
)NOTE: These restrictions are not checked or enforced. If they are not followed, then any failures are most likely to be encountered during Setup of Rack Resiliency.
For Kubernetes, Rack Resiliency uses the concept of topology spread constraints to implement zoning of master and worker NCNs.
The Kubernetes topology spread constraints are used to apply labels to nodes, in order to create Kubernetes zones.
Each node in every zone is labeled with the key topology.kubernetes.io/zone
and value <zone-id>
, where
<zone-id>
is the Kubernetes zone name. These labels can be used to identify all the management
nodes which belong to the same Kubernetes zone, and are used to schedule the critical services across the zones.
(ncn-mw#)
View Kubernetes zones.
kubectl get nodes -L topology.kubernetes.io/zone
Example output:
NAME STATUS ROLES AGE VERSION ZONE
ncn-m001 Ready control-plane 21d v1.32.5 x3000
ncn-m002 Ready control-plane 20d v1.32.5 x3001
ncn-m003 Ready control-plane 20d v1.32.5 x3002
ncn-w001 Ready <none> 20d v1.32.5 x3000
ncn-w002 Ready <none> 20d v1.32.5 x3001
ncn-w003 Ready <none> 20d v1.32.5 x3002
ncn-w004 Ready <none> 20d v1.32.5 x3000
Ceph is the utility storage platform that is used to enable pods to store persistent data. It is deployed to provide block, object, and file storage to the management services running on Kubernetes, as well as for telemetry data coming from the compute nodes.
For Ceph, Rack Resiliency uses the concept of buckets built with the CRUSH algorithm to implement zoning for storage nodes.
The objective of Ceph zoning is to make sure Ceph data gets replicated at the zone level across
storage nodes, so that there is no data loss in case of a rack failure. Ceph provides the CRUSH map
algorithm, which helps to segregate the data across zones. Using a combination of CRUSH rules and
bucket types (host
, rack
, row
, etc.), the data is replicated across zones.
In the absence of Rack Resiliency, CSM has host
as the top of the hierarchy of Ceph buckets.
To implement Ceph zones for storage nodes, the new bucket rack
is introduced on top of the hierarchy.
As shown in the above diagram, storage nodes get added to a rack
bucket based on their physical
location. The name of the rack
buckets is its Ceph zone name.
See Placement discovery for details on how physical placement of storage nodes is discovered.
More than one storage node can be added to the same bucket.
Rack Resiliency preconfigures rack
buckets and adds the storage nodes to them.
During Ceph zoning, the nodes discovered
during placement discovery are grouped in rack
buckets.
The current Ceph setup on CSM deploys three sets of Ceph services (Monitors, Managers, and MDS) on the
nodes ncn-s001
, ncn-s002
, and ncn-s003
in a hard-coded configuration.
This approach, however, does not support Rack Resiliency, as the services are statically assigned to
specific nodes.
The Rack Resiliency solution distributes the Ceph services across multiple zones.
The storage nodes assigned to each service are selected using a round-robin distribution strategy across
the zones (i.e. rack
buckets), ensuring a balanced and fault-tolerant configuration.
The number of Ceph Monitor services deployed will be either 3 or 5, depending on the total number of storage
nodes and their distribution across rack
buckets.
The above process ensures that the Ceph cluster remains operational in the event of a physical rack failure.
The Ceph services are zoned during Ceph zoning.
(ncn-msw#
) View Ceph zones.
ceph osd tree | grep rack
Example output:
-9 13.97278 rack x3000
-11 13.97278 rack x3001
-13 13.97278 rack x3002
(ncn-mw#
) List all configured zones.
cray rrs zones list --format toml
Example output:
[[Zones]]
Zone_Name = "x3000"
[Zones.Kubernetes_Topology_Zone]
Management_Master_Nodes = [ "ncn-m001",]
Management_Worker_Nodes = [ "ncn-w001", "ncn-w004",]
[Zones.CEPH_Zone]
Management_Storage_Nodes = [ "ncn-s001",]
[[Zones]]
Zone_Name = "x3001"
[Zones.Kubernetes_Topology_Zone]
Management_Master_Nodes = [ "ncn-m002",]
Management_Worker_Nodes = [ "ncn-w002",]
[Zones.CEPH_Zone]
Management_Storage_Nodes = [ "ncn-s003",]
[[Zones]]
Zone_Name = "x3002"
[Zones.Kubernetes_Topology_Zone]
Management_Master_Nodes = [ "ncn-m003",]
Management_Worker_Nodes = [ "ncn-w003",]
[Zones.CEPH_Zone]
Management_Storage_Nodes = [ "ncn-s002",]
(ncn-mw#
) Get detailed information about a specific zone.
cray rrs zones describe <zone-id> --format toml
Example output:
Zone_Name = "x3000"
[Management_Master]
Count = 1
Type = "Kubernetes_Topology_Zone"
[[Management_Master.Nodes]]
name = "ncn-m001"
status = "Ready"
[Management_Worker]
Count = 2
Type = "Kubernetes_Topology_Zone"
[[Management_Worker.Nodes]]
name = "ncn-w001"
status = "Ready"
[[Management_Worker.Nodes]]
name = "ncn-w004"
status = "Ready"
[Management_Storage]
Count = 1
Type = "CEPH_Zone"
[[Management_Storage.Nodes]]
name = "ncn-s001"
status = "Ready"
[Management_Storage.Nodes.osds]
up = [ "osd.1", "osd.4", "osd.7", "osd.10", "osd.13", "osd.16", "osd.20", "osd.23",]
This command returns detailed information about the zone, including the Kubernetes and storage NCNs that belong to it, along with their statuses.