The Rack Resiliency Service uses two Kubernetes ConfigMaps in the rack-resiliency
namespace for
monitoring and tracking the status of critical services.
The two ConfigMaps are named rrs-mon-static
(also referred to as the static ConfigMap) and
rrs-mon-dynamic
(also referred to as the dynamic ConfigMap).
This ConfigMap contains information such as the status of critical services, zones, and nodes. It is populated and used by RRS internally.
During every monitoring interval, the following things are done:
status
Configured
: All pods of the service are running.PartiallyConfigured
: At least one pod of the service is running, but not all.Unconfigured
: No pods of the service are running.balanced
true
: The service is spread across zones.false
: The service is not spread across zones (i.e. multiple replicas are in the same zone).(ncn-mw#
) View the data of the static ConfigMap used by the Rack Resiliency service.
Because the critical-service-config.json
data field contains a large JSON string,
the data is most easily displayed in two parts.
Show every data field except critical-service-config.json
.
kubectl get cm -n rack-resiliency rrs-mon-static -o jsonpath='{.data}' | jq 'del(."critical-service-config.json")'
Example output:
{
"ceph_monitoring_polling_interval": "60",
"ceph_monitoring_total_time": "600",
"ceph_pre_monitoring_delay": "60",
"default_message_level": "debug",
"k8s_monitoring_polling_interval": "60",
"k8s_monitoring_total_time": "600",
"k8s_pre_monitoring_delay": "40",
"last_updated_timestamp": "",
"log_dir": "/var/log/rr",
"unit_of_time": "seconds"
}
Show the critical-service-config.json
field.
kubectl get cm -n rack-resiliency rrs-mon-static -o jsonpath='{.data.critical-service-config\.json}' | jq
Truncated example output (the actual output will be larger):
{
"critical_services": {
"cilium-operator": {
"namespace": "kube-system",
"type": "Deployment"
},
"coredns": {
"namespace": "kube-system",
"type": "Deployment"
},
"...<output truncated>...",
"cray-capmc": {
"namespace": "services",
"type": "Deployment"
},
"kube-proxy": {
"namespace": "kube-system",
"type": "StatefulSet"
}
}
}
(ncn-mw#
) View the data of the dynamic ConfigMap used by the Rack Resiliency service.
The dynamic ConfigMap contains two fields, both of which contain large strings;
critical-service-config.json
contains a large JSON string
and dynamic-data.yaml
contains a large YAML string.
Show the critical-service-config.json
field.
kubectl get cm -n rack-resiliency rrs-mon-dynamic -o jsonpath='{.data.critical-service-config\.json}' | jq
Truncated example output (the actual output will be larger):
{
"critical_services": {
"cilium-operator": {
"namespace": "kube-system",
"type": "Deployment",
"status": "Configured",
"balanced": "true"
},
"coredns": {
"namespace": "kube-system",
"type": "Deployment",
"status": "Configured",
"balanced": "true"
},
"...<output truncated>...",
"cray-activemq-artemis-operator-controller-manager": {
"namespace": "dvs",
"type": "Deployment",
"status": "Configured",
"balanced": "true"
},
"kube-proxy": {
"namespace": "kube-system",
"type": "StatefulSet",
"status": "Unconfigured",
"balanced": "NA"
}
}
}
Show the dynamic-data.yaml
field.
kubectl get cm -n rack-resiliency rrs-mon-dynamic -o jsonpath='{.data.dynamic-data\.yaml}'
Truncated example output (the actual output will be larger):
cray_rrs_pod:
node: ncn-w004
rack: x3000c0s31b0n0
zone: x3000
state:
ceph_monitoring: ''
k8s_monitoring: ''
rms_state: Waiting
timestamps:
end_timestamp_ceph_monitoring: ''
end_timestamp_k8s_monitoring: ''
init_timestamp: '2025-08-04T04:17:17Z'
last_update_timestamp: '2025-08-07T02:56:28Z'
start_timestamp_api: ''
start_timestamp_ceph_monitoring: ''
start_timestamp_k8s_monitoring: ''
start_timestamp_rms: '2025-08-04T04:17:25Z'
zone:
ceph_zones:
x3000:
- name: ncn-s001
osds:
- name: osd.1
status: up
- name: osd.4
status: up
- name: osd.7
status: up
- name: osd.10
status: up
#...output truncated...
k8s_zones:
x3000:
- name: ncn-m001
status: Ready
- name: ncn-w001
status: Ready
- name: ncn-w004
status: Ready
#...output truncated...
HPE provides a standard set of critical services which are needed for the successful execution of user jobs. However, it possible to add additional critical services to the list. For more information on managing the critical services, see Manage Critical Service.