Cray System Management Documentation > Cray System Management (CSM) Administration Guide > utility storage > Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client

Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client

Use this procedure to troubleshoot Ceph MDS reporting slow requests after following the Identify Ceph Latency Issues procedure.

IMPORTANT: This procedure includes a mix of commands that need to be run on the host(s) running the MDS daemon(s) and other commands that can be run from any of the ceph-mon nodes. NOTICE: These steps are based off upstream Ceph documentation.

Prerequisites

The Identify Ceph Latency Issues procedure has been completed.
This issue has been encountered and this page is being used as a reference for commands.
The correct version of the documentation for the cluster running is being used.

Procedure

Identify the active MDS.

ceph fs status -f json-pretty|jq -r '.mdsmap[]|select(.state=="active")|.name'

cephfs.ncn-s003.ihwkop

ssh to the host running the active MDS.

Enter into a cephadm shell.

cephadm shell

Example output:

Inferring fsid 7350865a-0b21-11ec-b9fa-fa163e06c459
Inferring config /var/lib/ceph/7350865a-0b21-11ec-b9fa-fa163e06c459/mon.ncn-s003/config
Using recent ceph image arti.dev.cray.com/third-party-docker-stable-local/ceph/   ceph@sha256:70536e31b29a4241999ec4fd13d93e5860a5ffdc5467911e57e6bf04dfe68337
[ceph: root@ncn-s003 /]#

NOTE Messages such as “WARNING: The same type, major and minor should not be used for multiple devices” can be ignored. There is an upstream bug to address this issue.

(ceph#) Dump in-flight ops from the active MDS.
1. Find the active MDS.
```
export active_mds=$(ceph fs status -f json-pretty|jq -r '.mdsmap[]|select(.state=="active")|.name')
echo $active_mds
```
  Example output:
```
cephfs.ncn-s003.earesy
```
2. Dump ops_in_flight.
```
ceph daemon mds.$active_mds dump_ops_in_flight
```
  Example output:
```
{
    "ops": [],
    "num_ops": 0
}
```
  NOTE The example above is about how to run the command. Recreating the exact scenario to provide a full example is not easily done. This will be updated when the information is available.

General Steps from Upstream

Identify the stuck commands and examine why they are stuck.
1. Usually the last “event” will have been an attempt to gather locks, or sending the operation off to the MDS log.
2. If it is waiting on the OSDs, fix them.
3. If operations are stuck on a specific inode, then there is likely a client holding caps which prevent others from using it. This is caused by other of the following:
  1. The client is trying to flush out dirty data.
  2. There is a bug in CephFS’ distributed file lock code (the file “capabilities” [“caps”] system).
  IMPORTANT: If it is a result of a bug in the capabilities code, restarting the MDS is likely to resolve the problem.
4. If there are no slow requests reported on the MDS, and it is not reporting that clients are misbehaving, either the client has a problem or its requests are not reaching the MDS.