linux-kernel - 答复: 【外部邮件!】Re: [PATCH v2] ceph: fix deadlock in ceph_readdir

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5af9d165d05d4dddbb6c5d6d9d312367@kuaishou.com>
Date: Sun, 4 Jan 2026 09:19:17 +0000
From: 李磊 <lilei24@...ishou.com>
To: "Slava.Dubeyko@....com" <Slava.Dubeyko@....com>
CC: 孙朝 <sunzhao03@...ishou.com>,
	"ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"amarkuze@...hat.com" <amarkuze@...hat.com>, "idryomov@...il.com"
	<idryomov@...il.com>
Subject: 答复: 【外部邮件!】Re:  [PATCH v2] ceph: fix deadlock in ceph_readdir_prepopulate

Hi Slava,

I guess here is the deadlock scenario described by Zhao. 
1. DIR A has a snapshot and FILE X already in DIR A.
2. client submits create FILE X and readdir DIR A requests almost simultaneously.
3. The 2 requests are handled by 2 different kworkers.

kworker1:
1. since FILE X already exists, A inode with I_NEW flag is set by ceph_get_inode().
2. rinfo->snapblob_len != 0; So, kworker1 is blocked on down_write(&mdsc->snap_rwsem);

kworker2:
1. enter handle_reply
2. hold mdsc->snap_rwsem
3. call ceph_readdir_prepopulate()
4. in ceph_readdir_prepopulate(), it iterates all the ceph_mds_reply_dir_entry.
5. the ino of FILE X is found, and call ceph_get_inode().
6. However, the inode is set with I_NEW, kworker has to wait I_NEW cleared.

Please correct me if I missed something. Here is the timeline of the 2 kworkers


kworker1 [handle creating]				kworker2 [handle readdir]
	|											|
	v											v
handle_reply								handle_reply
	|											|
	|											v
	|										down_read(&mdsc->snap_rwrem);
	|											|
	|											|
	v											|
/* opening an existing inode */					        |
ceph_get_inode	(I_NEW)							|
	|											|
	v											|
rinfo->snapblob_len	!= 0;						        |
down_write(&mdsc->snap_rwsem);					|
												|
												|
												v
											ceph_fill_trace
												|
												v
											ceph_readdir_prepoulate
												|
												v
											ceph_get_inode
												|
												v
											iget5_locked
												|
												v
											ilookup5
												|
												v
											/* waiting I_NEW cleared */
											wait_on_inode


Here comes the callstack:
task:kworker/21:2    state:D stack:    0 pid:23053 ppid:     2 flags:0x00004000
Workqueue: events delayed_work [ceph]
Call Trace:
 __schedule+0x3a9/0x8d0
 schedule+0x49/0xb0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.11+0x354/0x430
 delayed_work+0x13b/0x210 [ceph]
 process_one_work+0x1cb/0x370
 worker_thread+0x30/0x390
 ? process_one_work+0x370/0x370
 kthread+0x13e/0x160
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
task:kworker/u113:1  state:D stack:    0 pid:34454 ppid:     2 flags:0x00004000
Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
 __schedule+0x3a9/0x8d0
 schedule+0x49/0xb0
 rwsem_down_write_slowpath+0x30a/0x5e0
 handle_reply+0x4d7/0x7f0 [ceph]
 ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
 mds_dispatch+0x10a/0x690 [ceph]
 ? calc_signature+0xdf/0x110 [libceph]
 ? ceph_x_check_message_signature+0x58/0xc0 [libceph]
 ceph_con_process_message+0x73/0x140 [libceph]
 ceph_con_v1_try_read+0x2f2/0x860 [libceph]
 ceph_con_workfn+0x31e/0x660 [libceph]
 process_one_work+0x1cb/0x370
 worker_thread+0x30/0x390
 ? process_one_work+0x370/0x370
 kthread+0x13e/0x160
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
task:kworker/u113:2  state:D stack:    0 pid:54267 ppid:     2 flags:0x00004000
Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
 __schedule+0x3a9/0x8d0
 ? bit_wait_io+0x60/0x60
 ? bit_wait_io+0x60/0x60
 schedule+0x49/0xb0
 bit_wait+0xd/0x60
 __wait_on_bit+0x2a/0x90
 ? ceph_force_reconnect+0x90/0x90 [ceph]
 out_of_line_wait_on_bit+0x91/0xb0
 ? bitmap_empty+0x20/0x20
 ilookup5.part.29+0x69/0x90
 ? ceph_force_reconnect+0x90/0x90 [ceph]
 ? ceph_ino_compare+0x30/0x30 [ceph]
 iget5_locked+0x26/0x90
 ceph_get_inode+0x45/0x130 [ceph]
 ceph_readdir_prepopulate+0x59f/0xca0 [ceph]
 handle_reply+0x78d/0x7f0 [ceph]
 ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
 mds_dispatch+0x10a/0x690 [ceph]
 ? calc_signature+0xdf/0x110 [libceph]
 ? ceph_x_check_message_signature+0x58/0xc0 [libceph]
 ceph_con_process_message+0x73/0x140 [libceph]
 ceph_con_v1_try_read+0x2f2/0x860 [libceph]
 ceph_con_workfn+0x31e/0x660 [libceph]
 process_one_work+0x1cb/0x370
 worker_thread+0x30/0x390
 ? process_one_work+0x370/0x370
 kthread+0x13e/0x160
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30


On Fri, 2025-08-08 at 15:08 +0800, Zhao Sun wrote:
> A deadlock can occur when ceph_get_inode is called outside of locks:
>
> 1) handle_reply calls ceph_get_inode, gets a new inode with I_NEW,
>    and blocks on mdsc->snap_rwsem for write.
>

Frankly speaking, it's hard to follow to your logic. Which particular mdsc-
>snap_rwsem lock do you mean in handle_reply()?

> 2) At the same time, ceph_readdir_prepopulate calls ceph_get_inode
>    for the same inode while holding mdsc->snap_rwsem for read,
>    and blocks on I_NEW.
>

The same here. Which particular mdsc->snap_rwsem lock do you mean in
ceph_readdir_prepopulate()?

> This causes an ABBA deadlock between mdsc->snap_rwsem and the I_NEW bit.
>
> The issue was introduced by commit bca9fc14c70f
> ("ceph: when filling trace, call ceph_get_inode outside of mutexes")
> which attempted to avoid a deadlock involving ceph_check_caps.
>
> That concern is now obsolete since commit 6a92b08fdad2
> ("ceph: don't take s_mutex or snap_rwsem in ceph_check_caps")
> which made ceph_check_caps fully lock-free.
>
> This patch primarily reverts bca9fc14c70f to resolve the new deadlock,
> with a few minor adjustments to fit the current codebase.
>

I assume that you hit the issue. I believe it will be good to have the
explanation which use-case/workload trigger the issue and which symptoms do you
see (system log's content, for example).

Thanks,
Slava.

> Link: https://tracker.ceph.com/issues/72307
> Signed-off-by: Zhao Sun <sunzhao03@...ishou.com>
> ---
>  fs/ceph/inode.c      | 26 ++++++++++++++++++++++----
>  fs/ceph/mds_client.c | 29 -----------------------------
>  2 files changed, 22 insertions(+), 33 deletions(-)
>
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 06cd2963e41e..d0f0035ee117 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1623,10 +1623,28 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>       }
>
>       if (rinfo->head->is_target) {
> -             /* Should be filled in by handle_reply */
> -             BUG_ON(!req->r_target_inode);
> +             in = xchg(&req->r_new_inode, NULL);
> +             tvino.ino = le64_to_cpu(rinfo->targeti.in->ino);
> +             tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid);
> +
> +             /*
> +              * If we ended up opening an existing inode, discard
> +              * r_new_inode
> +              */
> +             if (req->r_op == CEPH_MDS_OP_CREATE &&
> +                 !req->r_reply_info.has_create_ino) {
> +                     /* This should never happen on an async create */
> +                     WARN_ON_ONCE(req->r_deleg_ino);
> +                     iput(in);
> +                     in = NULL;
> +             }
> +
> +             in = ceph_get_inode(mdsc->fsc->sb, tvino, in);
> +             if (IS_ERR(in)) {
> +                     err = PTR_ERR(in);
> +                     goto done;
> +             }
>
> -             in = req->r_target_inode;
>               err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
>                               NULL, session,
>                               (!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
> @@ -1636,13 +1654,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>               if (err < 0) {
>                       pr_err_client(cl, "badness %p %llx.%llx\n", in,
>                                     ceph_vinop(in));
> -                     req->r_target_inode = NULL;
>                       if (in->i_state & I_NEW)
>                               discard_new_inode(in);
>                       else
>                               iput(in);
>                       goto done;
>               }
> +             req->r_target_inode = in;
>               if (in->i_state & I_NEW)
>                       unlock_new_inode(in);
>       }
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 230e0c3f341f..8b70f2b96f46 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3874,36 +3874,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>                                      session->s_con.peer_features);
>       mutex_unlock(&mdsc->mutex);
>
> -     /* Must find target inode outside of mutexes to avoid deadlocks */
>       rinfo = &req->r_reply_info;
> -     if ((err >= 0) && rinfo->head->is_target) {
> -             struct inode *in = xchg(&req->r_new_inode, NULL);
> -             struct ceph_vino tvino = {
> -                     .ino  = le64_to_cpu(rinfo->targeti.in->ino),
> -                     .snap = le64_to_cpu(rinfo->targeti.in->snapid)
> -             };
> -
> -             /*
> -              * If we ended up opening an existing inode, discard
> -              * r_new_inode
> -              */
> -             if (req->r_op == CEPH_MDS_OP_CREATE &&
> -                 !req->r_reply_info.has_create_ino) {
> -                     /* This should never happen on an async create */
> -                     WARN_ON_ONCE(req->r_deleg_ino);
> -                     iput(in);
> -                     in = NULL;
> -             }
> -
> -             in = ceph_get_inode(mdsc->fsc->sb, tvino, in);
> -             if (IS_ERR(in)) {
> -                     err = PTR_ERR(in);
> -                     mutex_lock(&session->s_mutex);
> -                     goto out_err;
> -             }
> -             req->r_target_inode = in;
> -     }
> -
>       mutex_lock(&session->s_mutex);
>       if (err < 0) {
>               pr_err_client(cl, "got corrupt reply mds%d(tid:%lld)\n",