Use this procedure to troubleshoot Ceph MDS reporting slow requests after following the Identify Ceph Latency Issues procedure.
IMPORTANT: This procedure includes a mix of commands that need to be run on the host(s) running the MDS daemon(s) and other commands that can be run from any of the ceph-mon nodes.
NOTICE: These steps are based off upstream Ceph documentation.
Identify the active MDS.
ncn-s00(1/2/3)# ceph fs status -f json-pretty|jq -r '.mdsmap[]|select(.state=="active")|.name'
cephfs.ncn-s003.ihwkop
ssh
to the host running the active MDS.
Enter into a cephadm shell.
ncn-s003# cephadm shell
Example output:
Inferring fsid 7350865a-0b21-11ec-b9fa-fa163e06c459
Inferring config /var/lib/ceph/7350865a-0b21-11ec-b9fa-fa163e06c459/mon.ncn-s003/config
Using recent ceph image arti.dev.cray.com/third-party-docker-stable-local/ceph/ ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337
[ceph: root@ncn-s003 /]#
NOTE: Messages such as “WARNING: The same type, major and minor should not be used for multiple devices” can be ignored. There is an upstream bug to address this issue.
Dump in-flight ops.
[ceph: root@ncn-s003 /]# ceph daemon mds.cephfs.ncn-s003.ihwkop dump_ops_in_flight
Example output:
{
"ops": [],
"num_ops": 0
}
NOTE: The example above is about how to run the command. Recreating the exact scenario to provide a full example is not easily done. This will be updated when the information is available.
Identify the stuck commands and examine why they are stuck.
Usually the last “event” will have been an attempt to gather locks, or sending the operation off to the MDS log.
If it is waiting on the OSDs, fix them.
If operations are stuck on a specific inode, you probably have a client holding caps which prevent others from using it, either because the client is trying to flush out dirty data or because you have encountered a bug in CephFS’ distributed file lock code (the file “capabilities” [“caps”] system).
If there are no slow requests reported on the MDS, and it is not reporting that clients are misbehaving, either the client has a problem or its requests are not reaching the MDS.