[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <832b26f3004b404b0c4a4474a26a02a260c71528.camel@ibm.com>
Date: Wed, 17 Sep 2025 17:55:47 +0000
From: Viacheslav Dubeyko <Slava.Dubeyko@....com>
To: "ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>,
"max.kellermann@...os.com" <max.kellermann@...os.com>,
Xiubo Li
<xiubli@...hat.com>,
"idryomov@...il.com" <idryomov@...il.com>,
"netfs@...ts.linux.dev" <netfs@...ts.linux.dev>,
Alex Markuze
<amarkuze@...hat.com>,
"linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
CC: "stable@...r.kernel.org" <stable@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"mjguzik@...il.com" <mjguzik@...il.com>
Subject: Re: [PATCH v2] ceph: fix deadlock bugs by making iput() calls
asynchronous
On Wed, 2025-09-17 at 15:59 +0200, Max Kellermann wrote:
> The iput() function is a dangerous one - if the reference counter goes
> to zero, the function may block for a long time due to:
>
> - inode_wait_for_writeback() waits until writeback on this inode
> completes
>
> - the filesystem-specific "evict_inode" callback can do similar
> things; e.g. all netfs-based filesystems will call
> netfs_wait_for_outstanding_io() which is similar to
> inode_wait_for_writeback()
>
> Therefore, callers must carefully evaluate the context they're in and
> check whether invoking iput() is a good idea at all.
>
> Most of the time, this is not a problem because the dcache holds
> references to all inodes, and the dcache is usually the one to release
> the last reference. But this assumption is fragile. For example,
> under (memcg) memory pressure, the dcache shrinker is more likely to
> release inode references, moving the inode eviction to contexts where
> that was extremely unlikely to occur.
>
> Our production servers "found" at least two deadlock bugs in the Ceph
> filesystem that were caused by this iput() behavior:
>
It sounds that you can easily to reproduce the issue. What is the reproduction
path of the issue? How can we reproduce the issue to test the patch?
> 1. Writeback may lead to iput() calls in Ceph (e.g. from
> ceph_put_wrbuffer_cap_refs()) which deadlocks in
> inode_wait_for_writeback(). Waiting for writeback completion from
> within writeback will obviously never be able to make any progress.
> This leads to blocked kworkers like this:
>
> INFO: task kworker/u777:6:1270802 blocked for more than 122 seconds.
> Not tainted 6.16.7-i1-es #773
> task:kworker/u777:6 state:D stack:0 pid:1270802 tgid:1270802 ppid:2
> task_flags:0x4208060 flags:0x00004000
> Workqueue: writeback wb_workfn (flush-ceph-3)
> Call Trace:
> <TASK>
> __schedule+0x4ea/0x17d0
> schedule+0x1c/0xc0
> inode_wait_for_writeback+0x71/0xb0
> evict+0xcf/0x200
> ceph_put_wrbuffer_cap_refs+0xdd/0x220
> ceph_invalidate_folio+0x97/0xc0
> ceph_writepages_start+0x127b/0x14d0
> do_writepages+0xba/0x150
> __writeback_single_inode+0x34/0x290
> writeback_sb_inodes+0x203/0x470
> __writeback_inodes_wb+0x4c/0xe0
> wb_writeback+0x189/0x2b0
> wb_workfn+0x30b/0x3d0
> process_one_work+0x143/0x2b0
> worker_thread+0x30a/0x450
>
> 2. In the Ceph messenger thread (net/ceph/messenger*.c), any iput()
> call may invoke ceph_evict_inode() which will deadlock in
> netfs_wait_for_outstanding_io(); since this blocks the messenger
> thread, completions from the Ceph servers will not ever be received
> and handled.
>
> It looks like these deadlock bugs have been in the Ceph filesystem
> code since forever (therefore no "Fixes" tag in this patch). There
> may be various ways to solve this:
>
> - make iput() asynchronous and defer the actual eviction like fput()
> (may add overhead)
>
> - make iput() only asynchronous if I_SYNC is set (doesn't solve random
> things happening inside the "evict_inode" callback)
>
> - add iput_deferred() to make this asynchronous behavior/overhead
> optional and explicit
>
> - refactor Ceph to avoid iput() calls from within writeback and
> messenger (if that is even possible)
>
> - add a Ceph-specific workaround
>
> After advice from Mateusz Guzik, I decided to do the latter. The
> implementation is simple because it piggybacks on the existing
> work_struct for ceph_queue_inode_work() - ceph_inode_work() calls
> iput() at the end which means we can donate the last reference to it.
>
> This patch adds ceph_iput_async() and converts lots of iput() calls to
> it - at least those that may come through writeback and the messenger.
>
> Signed-off-by: Max Kellermann <max.kellermann@...os.com>
> Cc: Mateusz Guzik <mjguzik@...il.com>
> Cc: stable@...r.kernel.org
> ---
>
> v1->v2: using atomic_add_unless() instead of atomic_add_unless() to
> avoid letting i_count drop to zero which may cause races (thanks
> Mateusz Guzik)
>
> Signed-off-by: Max Kellermann <max.kellermann@...os.com>
> ---
> fs/ceph/addr.c | 2 +-
> fs/ceph/caps.c | 16 ++++++++--------
> fs/ceph/dir.c | 2 +-
> fs/ceph/inode.c | 34 ++++++++++++++++++++++++++++++++++
> fs/ceph/mds_client.c | 30 +++++++++++++++---------------
> fs/ceph/quota.c | 4 ++--
> fs/ceph/snap.c | 10 +++++-----
> fs/ceph/super.h | 2 ++
> 8 files changed, 68 insertions(+), 32 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 322ed268f14a..fc497c91530e 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -265,7 +265,7 @@ static void finish_netfs_read(struct ceph_osd_request *req)
> subreq->error = err;
> trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
> netfs_read_subreq_terminated(subreq);
> - iput(req->r_inode);
> + ceph_iput_async(req->r_inode);
> ceph_dec_osd_stopping_blocker(fsc->mdsc);
> }
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index b1a8ff612c41..af9e3ae9ab7e 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1771,7 +1771,7 @@ void ceph_flush_snaps(struct ceph_inode_info *ci,
> spin_unlock(&mdsc->snap_flush_lock);
>
> if (need_put)
> - iput(inode);
> + ceph_iput_async(inode);
> }
>
> /*
> @@ -3319,7 +3319,7 @@ static void __ceph_put_cap_refs(struct ceph_inode_info *ci, int had,
> if (wake)
> wake_up_all(&ci->i_cap_wq);
> while (put-- > 0)
> - iput(inode);
> + ceph_iput_async(inode);
> }
>
> void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
> @@ -3419,7 +3419,7 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
> if (complete_capsnap)
> wake_up_all(&ci->i_cap_wq);
> while (put-- > 0) {
> - iput(inode);
> + ceph_iput_async(inode);
> }
> }
>
> @@ -3917,7 +3917,7 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
> if (wake_mdsc)
> wake_up_all(&mdsc->cap_flushing_wq);
> if (drop)
> - iput(inode);
> + ceph_iput_async(inode);
> }
>
> void __ceph_remove_capsnap(struct inode *inode, struct ceph_cap_snap *capsnap,
> @@ -4008,7 +4008,7 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid,
> wake_up_all(&ci->i_cap_wq);
> if (wake_mdsc)
> wake_up_all(&mdsc->cap_flushing_wq);
> - iput(inode);
> + ceph_iput_async(inode);
> }
> }
>
> @@ -4557,7 +4557,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> done:
> mutex_unlock(&session->s_mutex);
> done_unlocked:
> - iput(inode);
> + ceph_iput_async(inode);
> out:
> ceph_dec_mds_stopping_blocker(mdsc);
>
> @@ -4636,7 +4636,7 @@ unsigned long ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
> doutc(cl, "on %p %llx.%llx\n", inode,
> ceph_vinop(inode));
> ceph_check_caps(ci, 0);
> - iput(inode);
> + ceph_iput_async(inode);
> spin_lock(&mdsc->cap_delay_lock);
> }
>
> @@ -4675,7 +4675,7 @@ static void flush_dirty_session_caps(struct ceph_mds_session *s)
> spin_unlock(&mdsc->cap_dirty_lock);
> ceph_wait_on_async_create(inode);
> ceph_check_caps(ci, CHECK_CAPS_FLUSH);
> - iput(inode);
> + ceph_iput_async(inode);
> spin_lock(&mdsc->cap_dirty_lock);
> }
> spin_unlock(&mdsc->cap_dirty_lock);
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 32973c62c1a2..ec73ed52a227 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -1290,7 +1290,7 @@ static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
> ceph_mdsc_free_path_info(&path_info);
> }
> out:
> - iput(req->r_old_inode);
> + ceph_iput_async(req->r_old_inode);
> ceph_mdsc_release_dir_caps(req);
> }
>
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index f67025465de0..d7c0ed82bf62 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -2191,6 +2191,40 @@ void ceph_queue_inode_work(struct inode *inode, int work_bit)
> }
> }
>
> +/**
> + * Queue an asynchronous iput() call in a worker thread. Use this
> + * instead of iput() in contexts where evicting the inode is unsafe.
> + * For example, inode eviction may cause deadlocks in
> + * inode_wait_for_writeback() (when called from within writeback) or
> + * in netfs_wait_for_outstanding_io() (when called from within the
> + * Ceph messenger).
> + */
> +void ceph_iput_async(struct inode *inode)
> +{
> + if (unlikely(!inode))
> + return;
> +
> + if (likely(atomic_add_unless(&inode->i_count, -1, 1)))
> + /* somebody else is holding another reference -
> + * nothing left to do for us
> + */
> + return;
> +
> + doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
What's about this?
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
doutc(fsc, "%p %llx.%llx\n", inode, ceph_vinop(inode));
> +
> + /* simply queue a ceph_inode_work() (donating the remaining
> + * reference) without setting i_work_mask bit; other than
> + * putting the reference, there is nothing to do
> + */
> + WARN_ON_ONCE(!queue_work(ceph_inode_to_fs_client(inode)->inode_wq,
> + &ceph_inode(inode)->i_work));
This function looks like ceph_queue_inode_work() [1]. Can we use
ceph_queue_inode_work()?
[1] https://elixir.bootlin.com/linux/v6.17-rc6/source/fs/ceph/inode.c#L2176
> +
> + /* note: queue_work() cannot fail; it i_work were already
Do you imply "if i_work were already"?
Thansk,
Slava.
> + * queued, then it would be holding another reference, but no
> + * such reference exists
> + */
> +}
> +
> static void ceph_do_invalidate_pages(struct inode *inode)
> {
> struct ceph_client *cl = ceph_inode_to_client(inode);
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 3bc72b47fe4d..22d75c3be5a8 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -1097,14 +1097,14 @@ void ceph_mdsc_release_request(struct kref *kref)
> ceph_msg_put(req->r_reply);
> if (req->r_inode) {
> ceph_put_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
> - iput(req->r_inode);
> + ceph_iput_async(req->r_inode);
> }
> if (req->r_parent) {
> ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
> - iput(req->r_parent);
> + ceph_iput_async(req->r_parent);
> }
> - iput(req->r_target_inode);
> - iput(req->r_new_inode);
> + ceph_iput_async(req->r_target_inode);
> + ceph_iput_async(req->r_new_inode);
> if (req->r_dentry)
> dput(req->r_dentry);
> if (req->r_old_dentry)
> @@ -1118,7 +1118,7 @@ void ceph_mdsc_release_request(struct kref *kref)
> */
> ceph_put_cap_refs(ceph_inode(req->r_old_dentry_dir),
> CEPH_CAP_PIN);
> - iput(req->r_old_dentry_dir);
> + ceph_iput_async(req->r_old_dentry_dir);
> }
> kfree(req->r_path1);
> kfree(req->r_path2);
> @@ -1240,7 +1240,7 @@ static void __unregister_request(struct ceph_mds_client *mdsc,
> }
>
> if (req->r_unsafe_dir) {
> - iput(req->r_unsafe_dir);
> + ceph_iput_async(req->r_unsafe_dir);
> req->r_unsafe_dir = NULL;
> }
>
> @@ -1413,7 +1413,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc,
> cap = rb_entry(rb_first(&ci->i_caps), struct ceph_cap, ci_node);
> if (!cap) {
> spin_unlock(&ci->i_ceph_lock);
> - iput(inode);
> + ceph_iput_async(inode);
> goto random;
> }
> mds = cap->session->s_mds;
> @@ -1422,7 +1422,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc,
> cap == ci->i_auth_cap ? "auth " : "", cap);
> spin_unlock(&ci->i_ceph_lock);
> out:
> - iput(inode);
> + ceph_iput_async(inode);
> return mds;
>
> random:
> @@ -1841,7 +1841,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session,
> spin_unlock(&session->s_cap_lock);
>
> if (last_inode) {
> - iput(last_inode);
> + ceph_iput_async(last_inode);
> last_inode = NULL;
> }
> if (old_cap) {
> @@ -1874,7 +1874,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session,
> session->s_cap_iterator = NULL;
> spin_unlock(&session->s_cap_lock);
>
> - iput(last_inode);
> + ceph_iput_async(last_inode);
> if (old_cap)
> ceph_put_cap(session->s_mdsc, old_cap);
>
> @@ -1904,7 +1904,7 @@ static int remove_session_caps_cb(struct inode *inode, int mds, void *arg)
> if (invalidate)
> ceph_queue_invalidate(inode);
> while (iputs--)
> - iput(inode);
> + ceph_iput_async(inode);
> return 0;
> }
>
> @@ -1944,7 +1944,7 @@ static void remove_session_caps(struct ceph_mds_session *session)
> spin_unlock(&session->s_cap_lock);
>
> inode = ceph_find_inode(sb, vino);
> - iput(inode);
> + ceph_iput_async(inode);
>
> spin_lock(&session->s_cap_lock);
> }
> @@ -2512,7 +2512,7 @@ static void ceph_cap_unlink_work(struct work_struct *work)
> doutc(cl, "on %p %llx.%llx\n", inode,
> ceph_vinop(inode));
> ceph_check_caps(ci, CHECK_CAPS_FLUSH);
> - iput(inode);
> + ceph_iput_async(inode);
> spin_lock(&mdsc->cap_delay_lock);
> }
> }
> @@ -3933,7 +3933,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> !req->r_reply_info.has_create_ino) {
> /* This should never happen on an async create */
> WARN_ON_ONCE(req->r_deleg_ino);
> - iput(in);
> + ceph_iput_async(in);
> in = NULL;
> }
>
> @@ -5313,7 +5313,7 @@ static void handle_lease(struct ceph_mds_client *mdsc,
>
> out:
> mutex_unlock(&session->s_mutex);
> - iput(inode);
> + ceph_iput_async(inode);
>
> ceph_dec_mds_stopping_blocker(mdsc);
> return;
> diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
> index d90eda19bcc4..bba00f8926e6 100644
> --- a/fs/ceph/quota.c
> +++ b/fs/ceph/quota.c
> @@ -76,7 +76,7 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
> le64_to_cpu(h->max_files));
> spin_unlock(&ci->i_ceph_lock);
>
> - iput(inode);
> + ceph_iput_async(inode);
> out:
> ceph_dec_mds_stopping_blocker(mdsc);
> }
> @@ -190,7 +190,7 @@ void ceph_cleanup_quotarealms_inodes(struct ceph_mds_client *mdsc)
> node = rb_first(&mdsc->quotarealms_inodes);
> qri = rb_entry(node, struct ceph_quotarealm_inode, node);
> rb_erase(node, &mdsc->quotarealms_inodes);
> - iput(qri->inode);
> + ceph_iput_async(qri->inode);
> kfree(qri);
> }
> mutex_unlock(&mdsc->quotarealms_inodes_mutex);
> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> index c65f2b202b2b..19f097e79b3c 100644
> --- a/fs/ceph/snap.c
> +++ b/fs/ceph/snap.c
> @@ -735,7 +735,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc,
> if (!inode)
> continue;
> spin_unlock(&realm->inodes_with_caps_lock);
> - iput(lastinode);
> + ceph_iput_async(lastinode);
> lastinode = inode;
>
> /*
> @@ -762,7 +762,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc,
> spin_lock(&realm->inodes_with_caps_lock);
> }
> spin_unlock(&realm->inodes_with_caps_lock);
> - iput(lastinode);
> + ceph_iput_async(lastinode);
>
> if (capsnap)
> kmem_cache_free(ceph_cap_snap_cachep, capsnap);
> @@ -955,7 +955,7 @@ static void flush_snaps(struct ceph_mds_client *mdsc)
> ihold(inode);
> spin_unlock(&mdsc->snap_flush_lock);
> ceph_flush_snaps(ci, &session);
> - iput(inode);
> + ceph_iput_async(inode);
> spin_lock(&mdsc->snap_flush_lock);
> }
> spin_unlock(&mdsc->snap_flush_lock);
> @@ -1116,12 +1116,12 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> ceph_get_snap_realm(mdsc, realm);
> ceph_change_snap_realm(inode, realm);
> spin_unlock(&ci->i_ceph_lock);
> - iput(inode);
> + ceph_iput_async(inode);
> continue;
>
> skip_inode:
> spin_unlock(&ci->i_ceph_lock);
> - iput(inode);
> + ceph_iput_async(inode);
> }
>
> /* we may have taken some of the old realm's children. */
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index cf176aab0f82..361a72a67bb8 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -1085,6 +1085,8 @@ static inline void ceph_queue_flush_snaps(struct inode *inode)
> ceph_queue_inode_work(inode, CEPH_I_WORK_FLUSH_SNAPS);
> }
>
> +void ceph_iput_async(struct inode *inode);
> +
> extern int ceph_try_to_choose_auth_mds(struct inode *inode, int mask);
> extern int __ceph_do_getattr(struct inode *inode, struct page *locked_page,
> int mask, bool force);
Powered by blists - more mailing lists