ncn-s001
console is stuck ‘Sleeping for five seconds waiting Ceph to be healthy…’ncn-s001
console is stuck ‘Sleeping for five seconds waiting Ceph to be healthy…’NOTES: It can be appropriate for
ncn-s001
to wait with this message for a while. To check if Ceph OSDs are still coming up, runceph -s
and check the number of OSDs. After a couple minutes, runceph -s
again and see if there are more OSDs. If OSDs are still increasing, then continue to wait.
(ncn-s001
) Check Ceph health.
ceph health detail
ceph -s
If Ceph health shows the following health warning
HEALTH_WARN 1 pool(s) do not have an application enabled
[WRN] POOL_APP_NOT_ENABLED: 1 pool(s) do not have an application enabled
application not enabled on pool '.mgr'
use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.
(ncn-s001
) Then enable the .mgr
pool with the following command.
ceph osd pool application enable .mgr mgr
Expected output:
enabled application 'mgr' on pool '.mgr'
If Ceph health does not show the warning above, then most likely the storage node install will finish after waiting longer. Other Ceph troubleshooting procedures are in the troubleshooting section of the utility storage documentation.
If there is a failure in the creation of Ceph storage on the utility storage nodes for the following scenario, the Ceph storage might need to be reinitialized.
IMPORTANT (FOR NODE INSTALLS/REINSTALLS ONLY): If the Ceph install failed, check the following:
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 31.43875 root default
-3 10.47958 host ncn-s001
2 ssd 1.74660 osd.2 up 1.00000 1.00000
3 ssd 1.74660 osd.3 up 1.00000 1.00000
6 ssd 1.74660 osd.6 up 1.00000 1.00000
9 ssd 1.74660 osd.9 up 1.00000 1.00000
12 ssd 1.74660 osd.12 up 1.00000 1.00000
15 ssd 1.74660 osd.15 up 1.00000 1.00000
-5 10.47958 host ncn-s002
0 ssd 1.74660 osd.0 down 1.00000 1.00000 <-- the bad OSD
4 ssd 1.74660 osd.4 up 1.00000 1.00000
7 ssd 1.74660 osd.7 up 1.00000 1.00000
10 ssd 1.74660 osd.10 up 1.00000 1.00000
13 ssd 1.74660 osd.13 up 1.00000 1.00000
16 ssd 1.74660 osd.16 up 1.00000 1.00000
-7 10.47958 host ncn-s003
1 ssd 1.74660 osd.1 up 1.00000 1.00000
5 ssd 1.74660 osd.5 up 1.00000 1.00000
8 ssd 1.74660 osd.8 up 1.00000 1.00000
11 ssd 1.74660 osd.11 up 1.00000 1.00000
14 ssd 1.74660 osd.14 up 1.00000 1.00000
17 ssd 1.74660 osd.17 up 1.00000 1.00000
Get more information using the host and OSD.
ceph orch ps --daemon-type osd ncn-s002
NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID
osd.0 ncn-s002 running (23h) 7m ago 2d 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 98859a09a946
osd.10 ncn-s002 running (23h) 7m ago 2d 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 808162b421b8
osd.13 ncn-s002 running (23h) 7m ago 2d 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 594d6fd03361
osd.16 ncn-s002 running (23h) 7m ago 2d 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 726295e3625f
osd.4 ncn-s002 running (23h) 7m ago 2d 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c ee1987d99e5a
osd.7 ncn-s002 running (23h) 7m ago 2d 15.2.8 registry.local/ceph/ceph:v15.2.8 5553b0cb212c 78a89eaef92a
Optionally, leave off the host name and it will return all the OSD processing the cluster.
In order to zap a single OSD, it is necessary to gather some information.
(ncn-s#
) List the devices on that host with ceph orch device ls <hostname>
.
ceph orch device ls ncn-s002 --wide
Hostname Path Type Transport RPM Vendor Model Serial Size Health Ident Fault Available Reject Reasons
ncn-s002 /dev/sdc ssd Unknown Unknown ATA SAMSUNG MZ7LH1T9 S455NY0M811867 1920G Unknown N/A N/A No locked, LVM detected, Insufficient space (<10 extents) on vgs
ncn-s002 /dev/sdd ssd Unknown Unknown ATA SAMSUNG MZ7LH1T9 S455NY0M812407 1920G Unknown N/A N/A No locked, LVM detected, Insufficient space (<10 extents) on vgs
ncn-s002 /dev/sde ssd Unknown Unknown ATA SAMSUNG MZ7LH1T9 S455NY0M812406 1920G Unknown N/A N/A No locked, LVM detected, Insufficient space (<10 extents) on vgs
ncn-s002 /dev/sdf ssd Unknown Unknown ATA SAMSUNG MZ7LH1T9 S455NY0M812405 1920G Unknown N/A N/A No locked, LVM detected, Insufficient space (<10 extents) on vgs
ncn-s002 /dev/sdg ssd Unknown Unknown ATA SAMSUNG MZ7LH1T9 S455NY0M811921 1920G Unknown N/A N/A No locked, LVM detected, Insufficient space (<10 extents) on vgs
ncn-s002 /dev/sdh ssd Unknown Unknown ATA SAMSUNG MZ7LH1T9 S455NY0M811873 1920G Unknown N/A N/A No locked, LVM detected, Insufficient space (<10 extents) on vgs
The locked status in the Reject column is likely the result of a wipe failure.
(ncn-s#
) Find the drive path.
cephadm ceph-volume lvm list
Inferring fsid 8f4dd38b-ee84-4d29-8305-1ef24e61a5d8
Using recent Ceph image docker.io/ceph/ceph@sha256:16d37584df43bd6545d16e5aeba527de7d6ac3da3ca7b882384839d2d86acc7d
/usr/bin/podman: stdout
/usr/bin/podman: stdout
/usr/bin/podman: stdout ====== osd.0 =======
/usr/bin/podman: stdout
/usr/bin/podman: stdout [block] /dev/ceph-380453cf-4581-4616-b95e-30a8743bece0/osd-data-59bcf0c9-5867-41c3-8e40-2e99232cf8e9
/usr/bin/podman: stdout
/usr/bin/podman: stdout block device /dev/ceph-380453cf-4581-4616-b95e-30a8743bece0/osd-data-59bcf0c9-5867-41c3-8e40-2e99232cf8e9
/usr/bin/podman: stdout block uuid 54CjSj-kxEs-df0N-13Vs-miIF-g2KH-sX2UMQ
/usr/bin/podman: stdout cephx lockbox secret
/usr/bin/podman: stdout cluster fsid 8f4dd38b-ee84-4d29-8305-1ef24e61a5d8
/usr/bin/podman: stdout cluster name ceph
/usr/bin/podman: stdout crush device class None
/usr/bin/podman: stdout encrypted 0
/usr/bin/podman: stdout osd fsid b2eb119c-4f45-430b-96b0-bad9e8b9aca6
/usr/bin/podman: stdout osd id 0 <-- the OSD number
/usr/bin/podman: stdout osdspec affinity
/usr/bin/podman: stdout type block
/usr/bin/podman: stdout vdo 0
/usr/bin/podman: stdout devices /dev/sdf <--the path
/usr/bin/podman: stdout
Above output truncated for the purposes of this example.
(ncn-s#
)Zap a single device with ceph orch device zap (hostname) (device path)
.
ceph orch device zap ncn-s002 /dev/sdf