This page contains general Rack Resiliency troubleshooting topics.
(ncn-mw#) The Cray CLI is used to interact with multiple components of Rack Resiliency. Use the following command for usage information:
cray rrs --help
If a new critical service of type other than ‘Deployment’ or ‘StatefulSet’ is added through the Cray CLI, then an error is encountered.
(ncn-mw#) For example:
cray rrs criticalservices update --from-file file.json
Example output:
Usage: cray rrs criticalservices update [OPTIONS]
Try 'cray rrs criticalservices update -h' for help.
Error: Bad Request: Invalid request body: 1 validation error for ValidateCriticalServiceCmStaticType
critical_service_cm_static_type.critical_services.kube-proxy.type
Input should be 'Deployment' or 'StatefulSet' [type=literal_error, input_value='DaemonSet', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/literal_error
To monitor and debug RMS, check the logs of the cray-rrs Kubernetes pod running in the rack-resiliency namespace. Follow the steps below:
(ncn-mw#) Get the cray-rrs pod name.
RRS_POD=$(kubectl get pods -n rack-resiliency \
-l app.kubernetes.io/instance=cray-rrs \
-o custom-columns=:.metadata.name \
--no-headers); echo "${RRS_POD}"
(ncn-mw#) View its RMS container logs.
kubectl logs "${RRS_POD}" -c cray-rrs-rms -n rack-resiliency
Example log entry for a state change notification from the Hardware Management Notification Fanout Daemon (HMNFD):
2025-06-26 12:49:59,725 - INFO in rms - Notification received from HMNFD 2025-06-26 12:49:59,725 - WARNING in rms - Components '['x3000c0s11b0n0']' are changed to Off state.
Kyverno policy.Example log entry reporting a node being down:
2025-06-26 12:49:59,997 - INFO in rms - Some nodes in rack x3000 are down. Failed nodes: ['x3000c0s11b0n0']
Kyverno policy.Example log entry reporting a rack health issue:
2025-06-26 12:49:59,997 - INFO in rms - All the nodes in the rack x3000 are not healthy - RACK FAILURE
Kyverno policy.Example log entries reporting Ceph status:
...
2025-06-26 12:51:03,661 - WARNING in lib_rms - 1 out of 3 ceph nodes are not healthy
2025-06-26 12:51:05,069 - WARNING in lib_rms - CEPH is not healthy with health status as HEALTH_WARN
2025-06-26 12:51:05,069 - WARNING in lib_rms - CEPH PGs are in degraded state, but recovery is not happening
2025-06-26 12:51:06,341 - WARNING in lib_rms - Service alertmanager running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,341 - WARNING in lib_rms - Service crash running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,341 - WARNING in lib_rms - Service mds.admin-tools running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service mds.cephfs running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service mgr running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service mon running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service node-exporter running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service prometheus running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service rgw.site1 running on ncn-s002 is in host is offline state
Example log entry reporting an imbalanced service:
2025-06-30 07:02:36,235 - WARNING in lib_rms - list of imbalanced services are - ['istiod']
Example log entries reporting on the status of a service:
2025-06-30 07:02:34,906 - WARNING in lib_rms - Deployment 'cray-capmc' in namespace 'services' is not ready. Only 1 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,036 - WARNING in lib_rms - StatefulSet 'cray-console-data-postgres' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,057 - WARNING in lib_rms - StatefulSet 'cray-console-node' in namespace 'services' is not ready. Only 1 replicas are ready out of 2 desired replicas
2025-06-30 07:02:35,118 - WARNING in lib_rms - StatefulSet 'cray-dhcp-kea-postgres' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,249 - WARNING in lib_rms - StatefulSet 'cray-hbtd-bitnami-etcd' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,291 - WARNING in lib_rms - StatefulSet 'cray-hmnfd-bitnami-etcd' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,314 - WARNING in lib_rms - StatefulSet 'cray-keycloak' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,418 - WARNING in lib_rms - StatefulSet 'cray-power-control-bitnami-etcd' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,541 - WARNING in lib_rms - StatefulSet 'cray-spire-postgres' in namespace 'spire' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,562 - WARNING in lib_rms - StatefulSet 'cray-spire-server' in namespace 'spire' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,601 - WARNING in lib_rms - StatefulSet 'cray-vault' in namespace 'vault' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,700 - WARNING in lib_rms - StatefulSet 'hpe-slingshot-vnid' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,830 - WARNING in lib_rms - Deployment 'istiod' in namespace 'istio-system' is not ready. Only 3 replicas are ready out of 8 desired replicas
2025-06-30 07:02:35,851 - WARNING in lib_rms - StatefulSet 'keycloak-postgres' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:36,141 - WARNING in lib_rms - StatefulSet 'slurmdb-pxc' in namespace 'user' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:36,234 - WARNING in lib_rms - list of partially configured services are - ['cray-capmc', 'cray-console-data-postgres', 'cray-console-node', 'cray-dhcp-kea-postgres', 'cray-hbtd-bitnami-etcd', 'cray-hmnfd-bitnami-etcd', 'cray-keycloak', 'cray-power-control-bitnami-etcd', 'cray-spire-postgres', 'cray-spire-server', 'cray-vault', 'hpe-slingshot-vnid', 'istiod', 'keycloak-postgres', 'slurmdb-pxc']
2025-06-30 07:02:36,235 - WARNING in lib_rms - list of unconfigured services are - ['cilium-operator', 'cray-dvs-mqtt-ss', '`Kyverno`-cleanup-controller', '`Kyverno`-reports-controller', 'k8s-zone-api', 'kube-multus-ds']
Example log entry reporting that a critical service was not found:
2025-06-30 07:02:36,233 - ERROR in lib_rms - Error fetching StatefulSet kube-multus-ds: Not Found
Example log entry reporting a failure to register with HMNFD:
[2025-05-26 11:49:25,744] ERROR in rms: Attempt 1 : Failed to fetch subscription list from hmnfd. Error: 503 Server Error: Service Unavailable for url: https://api-gw-service-nmn.local/apis/hmnfd/hmi/v2/subscriptions
To know the startup time, last monitoring cycle timestamp, the polling intervals and the configured critical services it is necessary to view the ConfigMap. This helps to understand the various configuration parameters which control RMS behavior.
Note: It is recommended not to modify those configuration parameters without consulting HPE support.
The health of the critical services can be checked by listing and describing them using the RRS API or CLI. See Manage Critical Services.
cray-rrs pod is in init stateAfter rack resiliency chart is deployed the status of the cray-rrs deployment continues to be in Init state.
This is an expected behavior as the cray-rrs deployment waits in case any of the following three conditions are not met:
Check the status of pod for cray-rrs deployment:
kubectl get pod -n rack-resiliency
NAME READY STATUS RESTARTS AGE
cray-rrs-6c5585cfdf-lmctt 0/2 Init:0/2 0 6d5h
This can be confirmed by checking the logs of the cray-rrs-check container.
kubectl logs -n rack-resiliency cray-rrs-6c5585cfdf-lmctt cray-rrs-check
2025-09-18 22:06:22,589 - INFO in wait: Checking Rack Resiliency enablement and Kubernetes/CEPH zone creation...
2025-09-18 22:06:22,815 - INFO in wait: 'spec.kubernetes.services.rack-resiliency.enabled' value in customizations.yaml is: False
2025-09-18 22:06:22,817 - INFO in wait: Rack Resiliency is disabled.
This can be confirmed by checking the logs of the cray-rrs-check container.
kubectl logs -n rack-resiliency cray-rrs-6c5585cfdf-lmctt cray-rrs-check
2025-09-29 15:20:19,675 - INFO in wait: Checking Rack Resiliency enablement and Kubernetes/CEPH zone creation...
2025-09-29 15:20:19,884 - INFO in wait: 'spec.kubernetes.services.rack-resiliency.enabled' value in customizations.yaml is: True
2025-09-29 15:20:19,885 - INFO in wait: Rack resiliency is enabled.
2025-09-29 15:20:19,885 - INFO in wait: Checking zoning for Kubernetes and CEPH nodes...
2025-09-29 15:20:19,964 - ERROR in lib_rms: No K8s topology zone present
2025-09-29 15:20:19,966 - INFO in wait: Kubernetes zones are not created.
This can be confirmed by checking the logs of the cray-rrs-init container.
kubectl logs -n rack-resiliency cray-rrs-6c5585cfdf-lmctt cray-rrs-init | grep ConfigMap
2025-09-30 07:26:28,705 - WARNING in lib_configmap: Lock ConfigMap rrs-mon-dynamic-lock does not exist in namespace rack-resiliency; nothing to release
2025-09-30 07:26:28,717 - WARNING in lib_configmap: Lock ConfigMap rrs-mon-static-lock does not exist in namespace rack-resiliency; nothing to release
2025-09-30 07:26:28,718 - INFO in lib_configmap: [ad365f4c] Fetching ConfigMap rrs-mon-dynamic from namespace rack-resiliency
2025-09-30 07:26:28,770 - INFO in lib_configmap: Updating ConfigMap rrs-mon-dynamic in namespace rack-resiliency
2025-09-30 07:26:28,810 - WARNING in lib_configmap: Lock ConfigMap rrs-mon-dynamic-lock does not exist in namespace rack-resiliency; nothing to release
2025-09-30 07:26:28,811 - INFO in lib_configmap: ConfigMap rrs-mon-dynamic in namespace rack-resiliency updated successfully
2025-09-30 07:26:34,293 - INFO in lib_configmap: [85f28042] Fetching ConfigMap rrs-mon-static from namespace rack-resiliency
2025-09-30 07:26:34,304 - ERROR in lib_configmap: [85f28042] API error fetching ConfigMap
Considering the case where Rack Resiliency is enabled and when the nodes are moved physically from one rack to another using the procedure, always rollout restart the cray-rrs deployment.
(ncn-mw) kubectl rollout restart deployment -n rack-resiliency cray-rrs