A new Ceph service script that will check the status of Ceph and then verify that status against the individual Ceph storage nodes.
/opt/cray/tests/install/ncn/scripts/ceph-service-status.sh
usage: ceph-service-status.sh # runs a simple Ceph health check
ceph-service-status.sh -n <node> -s <service> # checks a single service on a single node
ceph-service-status.sh -n <node> -a true # checks all Ceph services on a node
ceph-service-status.sh -A true # checks all Ceph services on all nodes in a rolling fashion
ceph-service-status.sh -s <service name> # will find the where the service is running and report its status
Important: By default, the output of this command will not be verbose. This is to accommodate goss testing. For manual runs, please use the
-v true
flag.
Troubleshooting If the message parse error: Invalid numeric literal at line 1, column 5
is displayed, it is indicating that the cached SSH keys in known_hosts are no longer valid. The simple fix is > ~/.ssh/known_hosts
and re-run the script.
It will update the keys.
/opt/cray/tests/install/ncn/scripts/ceph-service-status.sh -v true
Example output:
FSID: c84ecf41-c535-4588-96c3-f6892bbd81ce FSID_STR: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce
Ceph is reporting a status of HEALTH_OK
Updating SSH keys..
Tests run: 1 Tests Passed: 1
/opt/cray/tests/install/ncn/scripts/ceph-service-status.sh -n ncn-s001 -v true -s mon.ncn-s001
Example output:
FSID: c84ecf41-c535-4588-96c3-f6892bbd81ce FSID_STR: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce
Ceph is reporting a status of HEALTH_OK
Updating SSH keys..
HOST: ncn-s001#######################
Service mon.ncn-s001 on ncn-s001 has been restarted and up for 9280 seconds
mon.ncn-s001's status is: running
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-mon.ncn-s001
Status: running
Tests run: 2 Tests Passed: 2
/opt/cray/tests/install/ncn/scripts/ceph-service-status.sh -n ncn-s001 -a true -v true
Example output:
FSID: c84ecf41-c535-4588-96c3-f6892bbd81ce FSID_STR: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce
Ceph is reporting a status of HEALTH_OK
Updating SSH keys..
HOST: ncn-s001#######################
Service mds.cephfs.ncn-s001.rmisfx on ncn-s001 has been restarted and up for 9206 seconds
mds.cephfs.ncn-s001.rmisfx's status is: running
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-mds.cephfs.ncn-s001.rmisfx
Status: running
Service mgr.ncn-s001 on ncn-s001 has been restarted and up for 9201 seconds
mgr.ncn-s001's status is: running
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-mgr.ncn-s001
Status: running
Service mon.ncn-s001 on ncn-s001 has been restarted and up for 9228 seconds
mon.ncn-s001's status is: running
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-mon.ncn-s001
Status: running
Service node-exporter.ncn-s001 on ncn-s001 has been restarted and up for 1231 seconds
node-exporter.ncn-s001's status is: running
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-node-exporter.ncn-s001
Status: running
Service on ncn-s001 is reporting up for 9209 seconds
osd.0's status is reporting up: 1 in: 1
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-osd.0
Status: running
Service on ncn-s001 is reporting up for 9200 seconds
osd.11's status is reporting up: 1 in: 1
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-osd.11
Status: running
Service on ncn-s001 is reporting up for 9208 seconds
osd.14's status is reporting up: 1 in: 1
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-osd.14
Status: running
Service on ncn-s001 is reporting up for 9206 seconds
osd.17's status is reporting up: 1 in: 1
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-osd.17
Status: running
Service on ncn-s001 is reporting up for 9213 seconds
osd.5's status is reporting up: 1 in: 1
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-osd.5
Status: running
Service on ncn-s001 is reporting up for 9207 seconds
osd.8's status is reporting up: 1 in: 1
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-osd.8
Status: running
Service rgw.site1.ncn-s001.kvxhwi on ncn-s001 has been restarted and up for 9210 seconds
rgw.site1.ncn-s001.kvxhwi's status is: running
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-rgw.site1.ncn-s001.kvxhwi
Status: running
Tests run: 12 Tests Passed: 12
/opt/cray/tests/install/ncn/scripts/ceph-service-status.sh -v true -s mon
Example output:
FSID: c84ecf41-c535-4588-96c3-f6892bbd81ce FSID_STR: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce
Ceph is reporting a status of HEALTH_OK
Updating SSH keys..
HOST: ncn-s001#######################
Service mon on ncn-s001 has been restarted and up for 9547 seconds
mon's status is: running
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-mon.ncn-s001
Status: running
HOST: ncn-s002#######################
Service mon on ncn-s002 has been restarted and up for 5643 seconds
mon's status is: running
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-mon.ncn-s002
Status: running
HOST: ncn-s003#######################
Service mon on ncn-s003 has been restarted and up for 2588 seconds
mon's status is: running
Service unit name: ceph-c84ecf41-c535-4588-96c3-f6892bbd81ce-mon.ncn-s003
Status: running
Tests run: 4 Tests Passed: 4
The output of the following command is similar to the above output, except it shows all services on all nodes. It is excluded in this case for brevity.
/opt/cray/tests/install/ncn/scripts/ceph-service-status.sh -v true -A true
IMPORTANT: This script can be run without the verbose flag and with an echo for the return code
echo $?
. A return code of0
means the check was clean. A return code of1
or greater means that there was an issue. In the latter case, re-run the command with the-v true
flag.