Follow these steps to enable (and optionally customize) Rack Resiliency.
(ncn-mw#) Retrieve the customizations.yaml file.
TMPDIR=$(mktemp -d -p ~) &&
kubectl get secrets -n loftsman site-init -o jsonpath='{.data.customizations\.yaml}' \
| base64 -d > "${TMPDIR}/customizations.yaml" \
&& echo "${TMPDIR}/customizations.yaml"
Example output:
/root/tmp.iM4FrDrJEJ/customizations.yaml
(ncn-mw#) Enable the feature in customizations.yaml.
yq write -i "${TMPDIR}/customizations.yaml" \
'spec.kubernetes.services.rack-resiliency.enabled' "true"
(ncn-mw#) Optionally, set custom zone name prefixes.
See Zone names for details on reasons for doing this and restrictions on names. This is optional; prefixes are not required. However, prefixes cannot be changed, set, or removed later.
Optionally, set a site-specific Kubernetes zone prefix.
In the following command, replace
k8s-prefix-stringwith the desired Kubernetes zone prefix.
yq write -i "${TMPDIR}/customizations.yaml" \
'spec.kubernetes.services.rack-resiliency.k8s_zone_prefix' "k8s-prefix-string"
Optionally, set a site-specific Ceph zone prefix.
In the following command, replace
ceph-prefix-stringwith the desired Ceph zone prefix.
yq write -i "${TMPDIR}/customizations.yaml" \
'spec.kubernetes.services.rack-resiliency.ceph_zone_prefix' "ceph-prefix-string"
(ncn-mw#) Update the site-init secret in the Kubernetes cluster.
kubectl delete secret -n loftsman site-init \
&& kubectl create secret -n loftsman generic site-init \
--from-file="${TMPDIR}/customizations.yaml"
Expected output:
secret/site-init created
(ncn-mw#) Confirm that the fields are set to the desired values.
kubectl get secrets -n loftsman site-init \
-o jsonpath='{.data.customizations\.yaml}' \
| base64 -d | yq r - 'spec.kubernetes.services.rack-resiliency'
Example output (in a case where only the Ceph zone prefix was set):
enabled: true
ceph_zone_prefix: my-ceph-prefix
Refer to Setup flows for information on the Ansible roles that are used to configure Rack Resiliency.
Since Rack Resiliency was disabled earlier and has just been enabled in the previous step,
the next step is to configure CFS to rerun the Ansible plays for Rack Resiliency. This is done using the script
refresh_master_storage_rack_resiliency_config.py.
This script applies the necessary configuration and sets up zones.
(ncn-mw#) Example usage:
/usr/share/doc/csm/scripts/operations/configuration/refresh_master_storage_rack_resiliency_config.py
Example output:
Checking if any Master NCN has the Rack Resiliency playbook layer...
✔ 3 Master NCN(s) have the Rack Resiliency playbook layer.
✔ 3 Storage NCN(s) have the Rack Resiliency playbook layer.
=== Processing Master NCNs ===
Updating 3 master CFS components...
✔ Master NCNs successfully updated.
=== Processing Storage NCNs ===
Updating 3 storage CFS components...
✔ Storage NCNs successfully updated.
All updates completed successfully. CFS batcher should soon reconfigure these NCNs.
SUCCESS
(ncn-mw#) Verify that the cray-rrs Helm chart is present in the rack-resiliency namespace.
The cray-rrs Helm chart should already be installed in the rack-resiliency namespace.
Verify this by listing the Helm charts in the rack-resiliency namespace.
helm ls -n rack-resiliency
Example output:
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
cray-rrs rack-resiliency 1 2025-09-26 21:43:12.5031915 +0000 UTC deployed cray-rrs-1.1.0 1.1.0
(ncn-mw#) List the resources in the rack-resiliency namespace.
kubectl get all -n rack-resiliency
In the command output, verify that the cray-rrs Pod, Service, Deployment, and ReplicaSet all exist.
The pod is expected to show
Init:0/2status at this point. For details on why the pod and deployment are not ready, seecray rrspod is ininitstate.
Example output:
NAME READY STATUS RESTARTS AGE
pod/cray-rrs-86d4465c9d-qf6f5 0/2 Init:0/2 0 19h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cray-rrs ClusterIP 10.18.164.23 <none> 80/TCP,8551/UTC 19h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cray-rrs 0/1 0 0 19h
NAME DESIRED CURRENT READY AGE
replicaset.apps/cray-rrs-86d4465c9d 1 0 0 19h
(ncn-mw#) Check the cluster policy.
kubectl get clusterpolicy insert-labels-topology-constraints
Ensure that the command output shows True in the READY column for insert-labels-topology-constraints.
Example output:
NAME ADMISSION BACKGROUND READY AGE MESSAGE
insert-labels-topology-constraints true true True 19h Ready
(ncn-mw#) Check the ConfigMaps.
kubectl get configmaps -n rack-resiliency
Verify that all of the ConfigMaps shown in the following example output are present on the system:
NAME DATA AGE
istio-ca-root-cert 1 11d
kube-root-ca.crt 1 11d
rrs-mon-dynamic 2 11d
rrs-mon-static 11 11d
As specified in Policy details, the exclude section of Kyverno
policy has to be removed if Rack Resiliency is enabled on a running system.
(ncn-mw#) Use the below command to remove the exclude section:
kubectl patch clusterpolicy insert-labels-topology-constraints --type=json \
-p='[{"op": "remove", "path": "/spec/rules/0/exclude"}]'
Example output:
clusterpolicy.kyverno.io/insert-labels-topology-constraints patched
Perform rollout restart of the critical services using the script rr_critical_service_restart.py.
The rr_critical_service_restart.py script performs a controlled restart of the services listed in the rrs-mon-static ConfigMap, in order to apply Kubernetes label rrflag=rr-<service-name>.
It skips services already labeled, restarts the remaining services one-by-one, and waits for each restart to complete.
The script requires the insert-labels-topology-constraints cluster policy to be present before it proceeds.
Important: This step restarts critical services (including cilium-operator, coredns, and other essential CSM services).
While Kubernetes performs rolling restarts to maintain service availability, there may be brief
disruptions as pods are restarted. In-flight requests to these services may fail and require retry.
For information on how to identify all of the critical services, see
List services in ConfigMap.
Example usage:
/usr/share/doc/csm/upgrade/scripts/k8s/rr_critical_service_restart.py
Truncated example output (the actual output will be larger):
Restarted deployment/cilium-operator in namespace kube-system
Restarted deployment/coredns in namespace kube-system
Skipping deployment/cray-activemq-artemis-operator-controller-manager: 'rrflag' label is already set in namespace dvs
Skipping deployment/cray-capmc: 'rrflag' label is already set in namespace services
Skipping deployment/cray-ceph-csi-cephfs-provisioner: 'rrflag' label is already set in namespace ceph-cephfs
Skipping deployment/cray-ceph-csi-rbd-provisioner: 'rrflag' label is already set in namespace ceph-rbd
Skipping deployment/cray-certmanager-cert-manager: 'rrflag' label is already set in namespace cert-manager
Skipping deployment/cray-certmanager-cert-manager-cainjector: 'rrflag' label is already set in namespace cert-manager
...
Skipping deployment/slurmdbd-backup: 'rrflag' label is already set in namespace user
Skipping deployment/sshot-net-operator: 'rrflag' label is already set in namespace sshot-net-operator
RR critical services rollout restart successful.
configmap/rrs-mon-dynamic patched (no change)
Set rollout_complete=true in ConfigMap 'rrs-mon-dynamic'
Done!
This step verifies that the cray-rrs pod has transitioned to the Ready state. The pod performs initialization checks
to ensure that the critical service rollout restart is completed and the required configuration is available.
These checks are performed periodically, so the pod may remain in Init state for a short time after the previous steps are completed.
(ncn-mw#) List the resources in the rack-resiliency namespace:
kubectl get all -n rack-resiliency
Wait for the pod to transition to Ready state. This typically takes up to 1-2 minutes. If the pod is still in Init:0/2 state,
then wait and retry the command until all of the following are true in the command output:
Ready.2/2 in the READY column.1/1 in the READY column.Example output when ready:
NAME READY STATUS RESTARTS AGE
pod/cray-rrs-86d4465c9d-qf6f5 2/2 Ready 0 19h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cray-rrs ClusterIP 10.18.164.23 <none> 80/TCP,8551/TCP 19h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cray-rrs 1/1 1 1 19h
NAME DESIRED CURRENT READY AGE
replicaset.apps/cray-rrs-86d4465c9d 1 1 1 19h
(ncn-mw#) If the pod remains in Init:0/2 state for longer than a few minutes, this may indicate a configuration issue.
Check the pod logs to investigate:
kubectl logs -n rack-resiliency <pod-name> -c <init-container-name>
For troubleshooting assistance, see cray rrs pod is in init state.