HEALTH_ERR
Module devicehealth
has failed table Device already existsIn the event that a ceph health detail
or a ceph -s
shows the below command then please follow the below procedure to fix the issue.
Error Message:
health: HEALTH_ERR
Module 'devicehealth' has failed
Stop the Ceph mgr services via systemd
on ncn-s001
, ncn-s002
, and ncn-s003
.
Find the systemd
unit name.
On each node listed above run the following:
ncn-s001:~ # cephadm ls|jq -r '.[]|select(.systemd_unit|contains ("mgr"))|.systemd_unit'
ceph-660ccbec-a6c1-11ed-af32-b8599ff91d22@mgr.ncn-s001.xufexf
Stop the service.
On each node listed above run the following:
ncn-s001:~ # systemctl stop ceph-660ccbec-a6c1-11ed-af32-b8599ff91d22@mgr.ncn-s001.xufexf
Remove the Ceph pool containing the corrupt table.
The following commands will be executed once from ncn-s001
, ncn-s002
, or ncn-s003
.
Set flag to allow pool deletion.
ncn-s001:~ # ceph config set mon mon_allow_pool_delete true
Delete pool
ncn-s001:~ # ceph osd pool rm .mgr .mgr --yes-i-really-really-mean-it
The output should contain pool '.mgr' removed
.
Unset flag to prohibit pool deletion.
ncn-s001:~ # ceph config set mon mon_allow_pool_delete false
Start the Ceph mgr services via systemd
on ncn-s001
, ncn-s002
, and ncn-s003
.
Find the systemd
unit name.
On each node listed above run the following:
ncn-s001:~ # cephadm ls|jq -r '.[]|select(.systemd_unit|contains ("mgr"))|.systemd_unit'
ceph-660ccbec-a6c1-11ed-af32-b8599ff91d22@mgr.ncn-s001.xufexf
Start the service.
On each node listed above run the following:
ncn-s001:~ # systemctl start ceph-660ccbec-a6c1-11ed-af32-b8599ff91d22@mgr.ncn-s001.xufexf
Verify Ceph mgr is operational.
Verify the .mgr pool was automatically created.
ncn-s001:~ # ceph osd lspools
This will list the pools. Verify that the .mgr
pools is present. This could take a minute or so to create the pool if the cluster is busy. If the pool is not created, please verify that the mgr processes are running using following step.
Verify all 3 mgr instances are running.
ncn-s001:~ # ceph -s
There should see 3 mgr processes in the output like below:
cluster:
id: 660ccbec-a6c1-11ed-af32-b8599ff91d22
health: HEALTH_OK
services:
mon: 3 daemons, quorum ncn-s001,ncn-s003,ncn-s002 (age 12m)
mgr: ncn-s001.xufexf(active, since 44s), standbys: ncn-s003.uieiom, ncn-s002.zlhlvs
mds: 2/2 daemons up, 3 standby, 1 hot standby
osd: 24 osds: 24 up (since 11m), 24 in (since 11m)
rgw: 3 daemons active (3 hosts, 1 zones)
Additional verification steps.
ncn-s001
, ncn-s002
, or ncn-s003
.
Fetch the Ceph Prometheus endpoint.
ncn-s001:~ # ceph mgr services
Expected output:
IMPORTANT: The below is an example output and ip addresses may vary, so please make sure that the correct endpoint is obtained from the Ceph cluster.
{
"dashboard": "https://10.252.1.11:8443/",
"prometheus": "http://10.252.1.11:9283/" <--- This is the url you need.
}
Curl against the endpoint to dump metrics.
ncn-s001:~ # curl -s http://10.252.1.11:9283/metrics
Expected output:
# HELP ceph_health_status Cluster health status
# TYPE ceph_health_status untyped
ceph_health_status 0.0
# HELP ceph_mon_quorum_status Monitors in quorum
# TYPE ceph_mon_quorum_status gauge
ceph_mon_quorum_status{ceph_daemon="mon.ncn-s001"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.ncn-s003"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.ncn-s002"} 1.0
# HELP ceph_fs_metadata FS Metadata
# TYPE ceph_fs_metadata untyped
ceph_fs_metadata{data_pools="3",fs_id="1",metadata_pool="2",name="cephfs"} 1.0
ceph_fs_metadata{data_pools="9",fs_id="2",metadata_pool="8",name="admin-tools"} 1.0
...
This is a small sample of the output. If the curl
is successful, then the active manager instance is active and will ensure that the standby mgr
daemons are functional and ready.