CephMgrIsAbsent
and CephMgrIsMissingReplicas
CephNetworkPacketsDropped
CPUThrottlingHigh
KubePodNotReady
PostgresqlFollowerReplicationLagSMA
PostgresqlHighRollbackRate
PostgresqlInactiveReplicationSlot
PostgresqlNotEnoughConnections
TargetDown
CephMgrIsAbsent
and CephMgrIsMissingReplicas
If the CephMgrIsAbsent
and/or CephMgrIsMissingReplicas
alerts fire, use the following steps to ensure the prometheus
module has been enabled for Ceph
. The following steps should be executed on ncn-s001
:
ceph mgr module ls | jq '.enabled_modules'
Example output:
[
"cephadm",
"iostat",
"restful"
]
If prometheus
is missing from the output, enable with the following command:
ceph mgr module enable prometheus
Confirm the module is now enabled:
ceph mgr module ls | jq '.enabled_modules'
Example output:
[
"cephadm",
"iostat",
"prometheus",
"restful"
]
The CephMgrIsAbsent
and CephMgrIsMissingReplicas
alerts should now clear in Prometheus.
CephNetworkPacketsDropped
The CephNetworkPacketsDropped
alert does not necessarily indicate there are packets being dropped on an interface on a storage node. In a future release this alert will be renamed
to be more generic. If this alert fires, inspect the IP address in the details of the alert to determine the node in question (it can be storage, master, or worker node). If the
interface in question is determined to be healthy, then this alert can be ignored.
CPUThrottlingHigh
Alerts for CPUThrottlingHigh
on gatekeeper-audit
can be ignored. This pod is not utilized in this release.
Alerts for CPUThrottlingHigh
on gatekeeper-controller-manager
can be ignored. These have low CPU requests, and it is normal for resource usage to spike when it is in use.
Alerts for CPUThrottlingHigh
on smartmon
pods can be ignored. It is normal for smartmon
pods’ resource usage to spike when it is polling. This will be fixed in a future release.
Alerts for CPUThrottlingHigh
on CFS services such as cfs-batcher
and cfs-trust
can be ignored. Because CFS is idle most of the time, these services have low CPU requests, and it is normal for CFS service resource usage to spike when it is in use.
KubePodNotReady
Alerts for KubePodNotReady
on cray-crus
may be ignored if the Slurm software has not been installed. The cray-crus
pod interacts with Slurm to manage compute node rolling upgrades.
PostgresqlFollowerReplicationLagSMA
Alerts for PostgresqlFollowerReplicationLagSMA
on sma-postgres-cluster
pods with slot_name="permanent_physical_1"
can be ignored. This slot_name
is disabled and will be removed in a future release.
PostgresqlHighRollbackRate
Alerts for PostgresqlHighRollbackRate
on spire-postgres
and smd-postgres
pods can be ignored. This is caused by an idle session that requires a timeout. This will be fixed in a future release.
PostgresqlInactiveReplicationSlot
Alerts for PostgresqlInactiveReplicationSlot
on sma-postgres-cluster
pods with slot_name="permanent_physical_1"
can be ignored. This slot_name
is disabled and will be removed in a future release.
PostgresqlNotEnoughConnections
Alerts for PostgresqlNotEnoughConnections
for datname="foo"
and datname="bar"
can be ignored. These databases are not used and will be removed in a future release.
TargetDown
Many of the alerts for TargetDown
for sysmgmt-health/cray-sysmgmt-health-kubernetes-pods/0
are due to job pods that have Completed
and no longer have an active endpoint that
can be scraped. If the target that is down is from a job pod that has completed, the TargetDown
alert for that pod can be ignored. This is being fixed in a future release.