The Resiliency Monitoring Service (RMS) is a part of the
Rack Resiliency Service (RRS).
RMS runs along with the RR API service inside the
cray-rrs
pod.
The RMS continuously monitors the health and availability of critical services, management nodes, and Ceph utility storage. The RMS uses the following components to provide its functionality:
RMS operates using two primary flows: the Control loop and the Monitoring events.
This loop ensures that the monitoring infrastructure is properly initialized and maintained. It performs the following tasks:
Triggered upon receiving a notification from HMNFD, this loop performs targeted analysis and response actions. These actions include:
When the RMS receives a notification from HMNFD, that will immediately trigger an RMS Monitoring event. Because the monitoring event is not instantaneous, there may be a short delay before the results of the event are reflected in the Rack Resiliency API and CLI responses.
Some incidents on a system do not result in HMNFD notifications. For example, a change in the health of a critical service pod. Because these incidents do not trigger a monitoring event, there will be a longer delay before the RMS notices them; it may take up to 10 minutes for such changes to be reflected in the Rack Resiliency API and CLI responses.
The same delay is a factor when an administrator makes changes to the critical services list (see Manage Critical Services). It may take up to 10 minutes before the RMS reads in the updated critical services data.
RMS reads the static ConfigMap (rrs-mon-static
) in order to
get the list of critical services to monitor.
RMS updates the dynamic ConfigMap (rrs-mon-dynamic
) at
regular intervals to reflect the latest status and balance of critical services, as well as the
latest zone information.
For more details on the Rack Resiliency ConfigMaps, see ConfigMaps.
RMS emits log messages at different severities during the monitoring cycle.
For example:
Log level | Source file | Example message content |
---|---|---|
INFO |
lib_rms |
Ceph is healthy |
WARNING |
rms |
List of component xnames changed to Standby state |
ERROR |
rms |
Failed to retrieve data from the Hardware State Manager (HSM) |
WARNING |
lib_rms |
List of imbalanced services |
WARNING |
lib_rms |
List of unconfigured services |
WARNING |
lib_rms |
Ceph host (e.g. ncn-s003 ) is in Offline state |
For details on troubleshooting RMS, including details on how to view and interpret the RMS logs, see the Resiliency Monitoring Service section of the Troubleshooting document.