linux-kernel - Re: 答复: 【外部邮件!】Re: [PATCH v2] ceph: fix deadlock in ceph_readdir

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <896a69bcc26e1d808e6dfe2d35e25c8b9f8b7f02.camel@ibm.com>
Date: Mon, 5 Jan 2026 19:55:10 +0000
From: Viacheslav Dubeyko <Slava.Dubeyko@....com>
To: "lilei24@...ishou.com" <lilei24@...ishou.com>
CC: "idryomov@...il.com" <idryomov@...il.com>,
        "sunzhao03@...ishou.com"
	<sunzhao03@...ishou.com>,
        "ceph-devel@...r.kernel.org"
	<ceph-devel@...r.kernel.org>,
        "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>,
        Alex Markuze <amarkuze@...hat.com>
Subject: Re:  答复: 【外部邮件!】Re:  [PATCH v2] ceph: fix deadlock in ceph_readdir_prepopulate

On Sun, 2026-01-04 at 09:19 +0000, 李磊 wrote:
> Hi Slava,
> 
> I guess here is the deadlock scenario described by Zhao. 
> 1. DIR A has a snapshot and FILE X already in DIR A.
> 2. client submits create FILE X and readdir DIR A requests almost simultaneously.
> 3. The 2 requests are handled by 2 different kworkers.
> 
> kworker1:
> 1. since FILE X already exists, A inode with I_NEW flag is set by ceph_get_inode().
> 2. rinfo->snapblob_len != 0; So, kworker1 is blocked on down_write(&mdsc->snap_rwsem);
> 
> kworker2:
> 1. enter handle_reply
> 2. hold mdsc->snap_rwsem
> 3. call ceph_readdir_prepopulate()
> 4. in ceph_readdir_prepopulate(), it iterates all the ceph_mds_reply_dir_entry.
> 5. the ino of FILE X is found, and call ceph_get_inode().
> 6. However, the inode is set with I_NEW, kworker has to wait I_NEW cleared.
> 
> Please correct me if I missed something. Here is the timeline of the 2 kworkers

This explanation sounds reasonable to me. If you can share the call trace of the
issue, then I assume that you have reproducing script or application. Could you
please share this reproducer? I would like to try to reproduce the issue.

It will be great if you can resend the patch with all this nice explanation in
the commit message. Could you please resend the patch?

Thanks,
Slava.

> 
> 
> kworker1 [handle creating]				kworker2 [handle readdir]
> 	|											|
> 	v											v
> handle_reply								handle_reply
> 	|											|
> 	|											v
> 	|										down_read(&mdsc->snap_rwrem);
> 	|											|
> 	|											|
> 	v											|
> /* opening an existing inode */					        |
> ceph_get_inode	(I_NEW)							|
> 	|											|
> 	v											|
> rinfo->snapblob_len	!= 0;						        |
> down_write(&mdsc->snap_rwsem);					|
> 												|
> 												|
> 												v
> 											ceph_fill_trace
> 												|
> 												v
> 											ceph_readdir_prepoulate
> 												|
> 												v
> 											ceph_get_inode
> 												|
> 												v
> 											iget5_locked
> 												|
> 												v
> 											ilookup5
> 												|
> 												v
> 											/* waiting I_NEW cleared */
> 											wait_on_inode
> 
> 
> Here comes the callstack:
> task:kworker/21:2    state:D stack:    0 pid:23053 ppid:     2 flags:0x00004000
> Workqueue: events delayed_work [ceph]
> Call Trace:
>  __schedule+0x3a9/0x8d0
>  schedule+0x49/0xb0
>  schedule_preempt_disabled+0xa/0x10
>  __mutex_lock.isra.11+0x354/0x430
>  delayed_work+0x13b/0x210 [ceph]
>  process_one_work+0x1cb/0x370
>  worker_thread+0x30/0x390
>  ? process_one_work+0x370/0x370
>  kthread+0x13e/0x160
>  ? set_kthread_struct+0x50/0x50
>  ret_from_fork+0x1f/0x30
> task:kworker/u113:1  state:D stack:    0 pid:34454 ppid:     2 flags:0x00004000
> Workqueue: ceph-msgr ceph_con_workfn [libceph]
> Call Trace:
>  __schedule+0x3a9/0x8d0
>  schedule+0x49/0xb0
>  rwsem_down_write_slowpath+0x30a/0x5e0
>  handle_reply+0x4d7/0x7f0 [ceph]
>  ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
>  mds_dispatch+0x10a/0x690 [ceph]
>  ? calc_signature+0xdf/0x110 [libceph]
>  ? ceph_x_check_message_signature+0x58/0xc0 [libceph]
>  ceph_con_process_message+0x73/0x140 [libceph]
>  ceph_con_v1_try_read+0x2f2/0x860 [libceph]
>  ceph_con_workfn+0x31e/0x660 [libceph]
>  process_one_work+0x1cb/0x370
>  worker_thread+0x30/0x390
>  ? process_one_work+0x370/0x370
>  kthread+0x13e/0x160
>  ? set_kthread_struct+0x50/0x50
>  ret_from_fork+0x1f/0x30
> task:kworker/u113:2  state:D stack:    0 pid:54267 ppid:     2 flags:0x00004000
> Workqueue: ceph-msgr ceph_con_workfn [libceph]
> Call Trace:
>  __schedule+0x3a9/0x8d0
>  ? bit_wait_io+0x60/0x60
>  ? bit_wait_io+0x60/0x60
>  schedule+0x49/0xb0
>  bit_wait+0xd/0x60
>  __wait_on_bit+0x2a/0x90
>  ? ceph_force_reconnect+0x90/0x90 [ceph]
>  out_of_line_wait_on_bit+0x91/0xb0
>  ? bitmap_empty+0x20/0x20
>  ilookup5.part.29+0x69/0x90
>  ? ceph_force_reconnect+0x90/0x90 [ceph]
>  ? ceph_ino_compare+0x30/0x30 [ceph]
>  iget5_locked+0x26/0x90
>  ceph_get_inode+0x45/0x130 [ceph]
>  ceph_readdir_prepopulate+0x59f/0xca0 [ceph]
>  handle_reply+0x78d/0x7f0 [ceph]
>  ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
>  mds_dispatch+0x10a/0x690 [ceph]
>  ? calc_signature+0xdf/0x110 [libceph]
>  ? ceph_x_check_message_signature+0x58/0xc0 [libceph]
>  ceph_con_process_message+0x73/0x140 [libceph]
>  ceph_con_v1_try_read+0x2f2/0x860 [libceph]
>  ceph_con_workfn+0x31e/0x660 [libceph]
>  process_one_work+0x1cb/0x370
>  worker_thread+0x30/0x390
>  ? process_one_work+0x370/0x370
>  kthread+0x13e/0x160
>  ? set_kthread_struct+0x50/0x50
>  ret_from_fork+0x1f/0x30
> 
> 
> On Fri, 2025-08-08 at 15:08 +0800, Zhao Sun wrote:
> > A deadlock can occur when ceph_get_inode is called outside of locks:
> > 
> > 1) handle_reply calls ceph_get_inode, gets a new inode with I_NEW,
> >     and blocks on mdsc->snap_rwsem for write.
> > 
> 
> Frankly speaking, it's hard to follow to your logic. Which particular mdsc-
> > snap_rwsem lock do you mean in handle_reply()?
> 
> > 2) At the same time, ceph_readdir_prepopulate calls ceph_get_inode
> >     for the same inode while holding mdsc->snap_rwsem for read,
> >     and blocks on I_NEW.
> > 
> 
> The same here. Which particular mdsc->snap_rwsem lock do you mean in
> ceph_readdir_prepopulate()?
> 
> > This causes an ABBA deadlock between mdsc->snap_rwsem and the I_NEW bit.
> > 
> > The issue was introduced by commit bca9fc14c70f
> > ("ceph: when filling trace, call ceph_get_inode outside of mutexes")
> > which attempted to avoid a deadlock involving ceph_check_caps.
> > 
> > That concern is now obsolete since commit 6a92b08fdad2
> > ("ceph: don't take s_mutex or snap_rwsem in ceph_check_caps")
> > which made ceph_check_caps fully lock-free.
> > 
> > This patch primarily reverts bca9fc14c70f to resolve the new deadlock,
> > with a few minor adjustments to fit the current codebase.
> > 
> 
> I assume that you hit the issue. I believe it will be good to have the
> explanation which use-case/workload trigger the issue and which symptoms do you
> see (system log's content, for example).
> 
> Thanks,
> Slava.
> 
> > Link: https://tracker.ceph.com/issues/72307  
> > Signed-off-by: Zhao Sun <sunzhao03@...ishou.com>
> > ---
> >   fs/ceph/inode.c      | 26 ++++++++++++++++++++++----
> >   fs/ceph/mds_client.c | 29 -----------------------------
> >   2 files changed, 22 insertions(+), 33 deletions(-)
> > 
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 06cd2963e41e..d0f0035ee117 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -1623,10 +1623,28 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> >        }
> > 
> >        if (rinfo->head->is_target) {
> > -             /* Should be filled in by handle_reply */
> > -             BUG_ON(!req->r_target_inode);
> > +             in = xchg(&req->r_new_inode, NULL);
> > +             tvino.ino = le64_to_cpu(rinfo->targeti.in->ino);
> > +             tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid);
> > +
> > +             /*
> > +              * If we ended up opening an existing inode, discard
> > +              * r_new_inode
> > +              */
> > +             if (req->r_op == CEPH_MDS_OP_CREATE &&
> > +                 !req->r_reply_info.has_create_ino) {
> > +                     /* This should never happen on an async create */
> > +                     WARN_ON_ONCE(req->r_deleg_ino);
> > +                     iput(in);
> > +                     in = NULL;
> > +             }
> > +
> > +             in = ceph_get_inode(mdsc->fsc->sb, tvino, in);
> > +             if (IS_ERR(in)) {
> > +                     err = PTR_ERR(in);
> > +                     goto done;
> > +             }
> > 
> > -             in = req->r_target_inode;
> >                err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
> >                                NULL, session,
> >                                (!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
> > @@ -1636,13 +1654,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> >                if (err < 0) {
> >                        pr_err_client(cl, "badness %p %llx.%llx\n", in,
> >                                      ceph_vinop(in));
> > -                     req->r_target_inode = NULL;
> >                        if (in->i_state & I_NEW)
> >                                discard_new_inode(in);
> >                        else
> >                                iput(in);
> >                        goto done;
> >                }
> > +             req->r_target_inode = in;
> >                if (in->i_state & I_NEW)
> >                        unlock_new_inode(in);
> >        }
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 230e0c3f341f..8b70f2b96f46 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -3874,36 +3874,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> >                                       session->s_con.peer_features);
> >        mutex_unlock(&mdsc->mutex);
> > 
> > -     /* Must find target inode outside of mutexes to avoid deadlocks */
> >        rinfo = &req->r_reply_info;
> > -     if ((err >= 0) && rinfo->head->is_target) {
> > -             struct inode *in = xchg(&req->r_new_inode, NULL);
> > -             struct ceph_vino tvino = {
> > -                     .ino  = le64_to_cpu(rinfo->targeti.in->ino),
> > -                     .snap = le64_to_cpu(rinfo->targeti.in->snapid)
> > -             };
> > -
> > -             /*
> > -              * If we ended up opening an existing inode, discard
> > -              * r_new_inode
> > -              */
> > -             if (req->r_op == CEPH_MDS_OP_CREATE &&
> > -                 !req->r_reply_info.has_create_ino) {
> > -                     /* This should never happen on an async create */
> > -                     WARN_ON_ONCE(req->r_deleg_ino);
> > -                     iput(in);
> > -                     in = NULL;
> > -             }
> > -
> > -             in = ceph_get_inode(mdsc->fsc->sb, tvino, in);
> > -             if (IS_ERR(in)) {
> > -                     err = PTR_ERR(in);
> > -                     mutex_lock(&session->s_mutex);
> > -                     goto out_err;
> > -             }
> > -             req->r_target_inode = in;
> > -     }
> > -
> >        mutex_lock(&session->s_mutex);
> >        if (err < 0) {
> >                pr_err_client(cl, "got corrupt reply mds%d(tid:%lld)\n",
>