Identify down OSDs and manually bring them back up.
Troubleshoot the Ceph health detail reporting down OSDs. Ensuring that OSDs are operational and data is balanced across them will help remove the likelihood of hotspots being created.
This procedure requires admin privileges.
Identify the down OSDs.
ncn-m/s(001/2/3)# ceph osd tree down
Example output:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 62.87558 root default
-7 20.95853 host ncn-s002
1 ssd 3.49309 osd.1 down 1.00000 1.00000
3 ssd 3.49309 osd.3 down 1.00000 1.00000
7 ssd 3.49309 osd.7 down 1.00000 1.00000
10 ssd 3.49309 osd.10 down 1.00000 1.00000
13 ssd 3.49309 osd.13 down 1.00000 1.00000
16 ssd 3.49309 osd.16 down 1.00000 1.00000
Restart the down OSDs.
Option 1:
Restart the OSD utilizing ceph orch
ncn-m/s00(1/2/3)# ceph orch daemon restart osd.<number>
Option 2:
Check the logs for the OSD that is down.
Use the OSD number for the down OSD returned in the command above.
ncn-m/s(001/2/3)# ceph osd find OSD_ID
Manually restart the OSD.
This step must be done on the node with the reported down OSD.
ceph orch daemon restart osd.<number>
Troubleshooting: If the service is not restarted with ceph orch
, restart it using Manage Ceph Services.
Verify the OSDs are running again.
# ceph osd tree down
Example output:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 62.87558 root default
-7 20.95853 host ncn-s002
1 ssd 3.49309 osd.1 up 1.00000 1.00000
3 ssd 3.49309 osd.3 up 1.00000 1.00000
7 ssd 3.49309 osd.7 up 1.00000 1.00000
10 ssd 3.49309 osd.10 up 1.00000 1.00000
13 ssd 3.49309 osd.13 up 1.00000 1.00000
16 ssd 3.49309 osd.16 up 1.00000 1.00000
If the OSD dies again, check dmesg for drive failures.