The primary goal of the System Management Health service is to enable system administrators to assess the health of their system. Operators need to quickly and efficiently troubleshoot system issues as they occur and be confident that a lack of issues indicates the system is operating normally. This service currently runs as a Helm chart on the system’s management Kubernetes cluster and monitors the health status of core system components, triggering alerts as potential issues are observed. It uses VictoriaMetrics to aggregate metrics from etcd, Kubernetes, Istio, and Ceph, all of which include support for the Prometheus API. The System Management Health service relies on the following tools:
victoria-metrics operator
provides custom resource definitions (CRDs) that make it easy to operate VictoriaMetrics and
Alertmanager instances, scrape metrics from service endpoints, and trigger alertsvictoria-metrics-k8s-stack
Helm chart integrates the victoria operator, VictoriaMetrics, Alertmanager, Grafana,
node exporters (DaemonSet), and kube-state-metrics
to provide a monitoring solution for Kubernetes clustersThe System Management Health service is intended to complement the System Monitoring Application (SMA) Framework, but
the two are currently not integrated. The System Management Health metrics are not available using the Telemetry API.
This service scrapes metrics from system components like Ceph, Kubernetes, and the hosts using node exporter,
kube-state-metrics
, and cadvisor
. The design is flexible and supports: