This page contains general Rack Resiliency troubleshooting topics.
(ncn-mw#
) The Cray CLI is used to interact with multiple components of Rack Resiliency. Use the following command for usage information:
cray rrs --help
If a new critical service of type other than ‘Deployment’ or ‘StatefulSet’ is added through the Cray CLI, then an error is encountered.
(ncn-mw#
) For example:
cray rrs criticalservices update --from-file file.json
Example output:
Usage: cray rrs criticalservices update [OPTIONS]
Try 'cray rrs criticalservices update -h' for help.
Error: Bad Request: Invalid request body: 1 validation error for ValidateCriticalServiceCmStaticType
critical_service_cm_static_type.critical_services.kube-proxy.type
Input should be 'Deployment' or 'StatefulSet' [type=literal_error, input_value='DaemonSet', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/literal_error
To monitor and debug RMS, check the logs of the cray-rrs
Kubernetes pod running in the rack-resiliency
namespace. Follow the steps below:
(ncn-mw#
) Get the cray-rrs
pod name.
RRS_POD=$(kubectl get pods -n rack-resiliency \
-l app.kubernetes.io/instance=cray-rrs \
-o custom-columns=:.metadata.name \
--no-headers); echo "${RRS_POD}"
(ncn-mw#
) View its RMS container logs.
kubectl logs "${RRS_POD}" -c cray-rrs-rms -n rack-resiliency
Example log entry for a state change notification from the Hardware Management Notification Fanout Daemon (HMNFD):
2025-06-26 12:49:59,725 - INFO in rms - Notification received from HMNFD 2025-06-26 12:49:59,725 - WARNING in rms - Components '['x3000c0s11b0n0']' are changed to Off state.
Kyverno
policy.Example log entry reporting a node being down:
2025-06-26 12:49:59,997 - INFO in rms - Some nodes in rack x3000 are down. Failed nodes: ['x3000c0s11b0n0']
Kyverno
policy.Example log entry reporting a rack health issue:
2025-06-26 12:49:59,997 - INFO in rms - All the nodes in the rack x3000 are not healthy - RACK FAILURE
Kyverno
policy.Example log entries reporting Ceph status:
...
2025-06-26 12:51:03,661 - WARNING in lib_rms - 1 out of 3 ceph nodes are not healthy
2025-06-26 12:51:05,069 - WARNING in lib_rms - CEPH is not healthy with health status as HEALTH_WARN
2025-06-26 12:51:05,069 - WARNING in lib_rms - CEPH PGs are in degraded state, but recovery is not happening
2025-06-26 12:51:06,341 - WARNING in lib_rms - Service alertmanager running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,341 - WARNING in lib_rms - Service crash running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,341 - WARNING in lib_rms - Service mds.admin-tools running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service mds.cephfs running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service mgr running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service mon running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service node-exporter running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service osd.all-available-devices running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service prometheus running on ncn-s002 is in host is offline state
2025-06-26 12:51:06,342 - WARNING in lib_rms - Service rgw.site1 running on ncn-s002 is in host is offline state
Example log entry reporting an imbalanced service:
2025-06-30 07:02:36,235 - WARNING in lib_rms - list of imbalanced services are - ['istiod']
Example log entries reporting on the status of a service:
2025-06-30 07:02:34,906 - WARNING in lib_rms - Deployment 'cray-capmc' in namespace 'services' is not ready. Only 1 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,036 - WARNING in lib_rms - StatefulSet 'cray-console-data-postgres' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,057 - WARNING in lib_rms - StatefulSet 'cray-console-node' in namespace 'services' is not ready. Only 1 replicas are ready out of 2 desired replicas
2025-06-30 07:02:35,118 - WARNING in lib_rms - StatefulSet 'cray-dhcp-kea-postgres' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,249 - WARNING in lib_rms - StatefulSet 'cray-hbtd-bitnami-etcd' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,291 - WARNING in lib_rms - StatefulSet 'cray-hmnfd-bitnami-etcd' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,314 - WARNING in lib_rms - StatefulSet 'cray-keycloak' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,418 - WARNING in lib_rms - StatefulSet 'cray-power-control-bitnami-etcd' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,541 - WARNING in lib_rms - StatefulSet 'cray-spire-postgres' in namespace 'spire' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,562 - WARNING in lib_rms - StatefulSet 'cray-spire-server' in namespace 'spire' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,601 - WARNING in lib_rms - StatefulSet 'cray-vault' in namespace 'vault' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,700 - WARNING in lib_rms - StatefulSet 'hpe-slingshot-vnid' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:35,830 - WARNING in lib_rms - Deployment 'istiod' in namespace 'istio-system' is not ready. Only 3 replicas are ready out of 8 desired replicas
2025-06-30 07:02:35,851 - WARNING in lib_rms - StatefulSet 'keycloak-postgres' in namespace 'services' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:36,141 - WARNING in lib_rms - StatefulSet 'slurmdb-pxc' in namespace 'user' is not ready. Only 2 replicas are ready out of 3 desired replicas
2025-06-30 07:02:36,234 - WARNING in lib_rms - list of partially configured services are - ['cray-capmc', 'cray-console-data-postgres', 'cray-console-node', 'cray-dhcp-kea-postgres', 'cray-hbtd-bitnami-etcd', 'cray-hmnfd-bitnami-etcd', 'cray-keycloak', 'cray-power-control-bitnami-etcd', 'cray-spire-postgres', 'cray-spire-server', 'cray-vault', 'hpe-slingshot-vnid', 'istiod', 'keycloak-postgres', 'slurmdb-pxc']
2025-06-30 07:02:36,235 - WARNING in lib_rms - list of unconfigured services are - ['cilium-operator', 'cray-dvs-mqtt-ss', '`Kyverno`-cleanup-controller', '`Kyverno`-reports-controller', 'k8s-zone-api', 'kube-multus-ds']
Example log entry reporting that a critical service was not found:
2025-06-30 07:02:36,233 - ERROR in lib_rms - Error fetching StatefulSet kube-multus-ds: Not Found
Example log entry reporting a failure to register with HMNFD:
[2025-05-26 11:49:25,744] ERROR in rms: Attempt 1 : Failed to fetch subscription list from hmnfd. Error: 503 Server Error: Service Unavailable for url: https://api-gw-service-nmn.local/apis/hmnfd/hmi/v2/subscriptions
To know the startup time, last monitoring cycle timestamp, the polling intervals and the configured critical services it is necessary to view the ConfigMap. This helps to understand the various configuration parameters which control RMS behavior.
Note: It is recommended not to modify those configuration parameters without consulting HPE support.
The health of the critical services can be checked by listing and describing them using the RRS API or CLI. See Manage Critical Services.