[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHHiH+2+LdQGBs8cS4Hr6sDWk6diEG+JQ7HMQbWdiNtKAA@mail.gmail.com>
Date: Wed, 17 Sep 2025 15:22:42 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Max Kellermann <max.kellermann@...os.com>
Cc: slava.dubeyko@....com, xiubli@...hat.com, idryomov@...il.com,
amarkuze@...hat.com, ceph-devel@...r.kernel.org, netfs@...ts.linux.dev,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH] ceph: fix deadlock bugs by making iput() calls asynchronous
On Wed, Sep 17, 2025 at 3:13 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> On Wed, Sep 17, 2025 at 2:44 PM Max Kellermann <max.kellermann@...os.com> wrote:
>
> I don't know about ceph internals, so no comment on that front.
>
> > +/**
> > + * Queue an asynchronous iput() call in a worker thread. Use this
> > + * instead of iput() in contexts where evicting the inode is unsafe.
> > + * For example, inode eviction may cause deadlocks in
> > + * inode_wait_for_writeback() (when called from within writeback) or
> > + * in netfs_wait_for_outstanding_io() (when called from within the
> > + * Ceph messenger).
> > + *
> > + * @n: how many references to put
> > + */
> > +void ceph_iput_n_async(struct inode *inode, int n)
> > +{
> > + if (unlikely(!inode))
> > + return;
> > +
> > + if (likely(atomic_sub_return(n, &inode->i_count) > 0))
> > + /* somebody else is holding another reference -
> > + * nothing left to do for us
> > + */
> > + return;
> > +
> > + doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
> > +
> > + /* the reference counter is now 0, i.e. nobody else is holding
> > + * a reference to this inode; restore it to 1 and donate it to
> > + * ceph_inode_work() which will call iput() at the end
> > + */
> > + atomic_set(&inode->i_count, 1);
> > +
>
> That loop over iput() indeed asks for a variant which grabs an
> explicit count to subtract.
>
> However, you cannot legally transition to 0 without ->i_lock held. By
> API contract the ->drop_inode routine needs to be called when you get
> here and other CPUs are prevented from refing the inode.
>
> While it is true nobody *refs* the inode, it is still hanging out on
> the superblock list where it can get picked up by forced unmount and
> on the inode hash where it can get picked up by lookup. With a
> refcount of 0, ->i_lock not held and no flags added, from their POV
> this is a legally cached inode they can do whatever they want with.
>
> So that force setting of refcount to 1 might be a use-after-free if
> this raced against another iput or it might be losing a reference
> picked up by someone else.
>
> If you got the idea to bring back one frem from iput() in the stock kernel:
>
> if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> atomic_inc(&inode->i_count);
>
> Note this guy still makes sure to take the lock first. As a side note
> this weird deref to 0 + ref back to 1 business is going away [1].
>
> I don't know what's the handy Linux way to sub an arbitrary amount as
> long as the target is not x, I guess worst case one can just write a
> cmpxchg loop by hand.
>
> Given that this is a reliability fix I would forego optimizations of the sort.
>
> Does the patch convert literally all iput calls within ceph into the
> async variant? I would be worried that mandatory deferral of literally
> all final iputs may be a regression from perf standpoint.
>
> I see you are mentioning another deadlock, perhaps being in danger of
> deadlocking is something you could track with a flag within ceph (just
> like it happens for writeback)? Then the local iput variant could
> check on both. I have no idea if this is feasible at all for the netfs
> thing.
>
> No matter what though, it looks like the way forward concerning
> ->i_count is to make sure it does not drop to 0 within the new
> primitive.
>
That is to say the routine async routine should start with:
if (atomic_add_unless(&inode->i_count, -1, 1))
return;
/* defer to iput here */
this is copy pasted, no credit needed for anyone
As you can see there is some work going on concerning these routines,
I would wager that loop over iput in writeback will go away in
mainline after the dust settles ;)
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.18.inode.refcount.preliminaries&id=9e70e985bdc2[1[1c6fe7a160e4d59ddd7c0a39bc077
>
> > + /* simply queue a ceph_inode_work() without setting
> > + * i_work_mask bit; other than putting the reference, there is
> > + * nothing to do
> > + */
> > + WARN_ON_ONCE(!queue_work(ceph_inode_to_fs_client(inode)->inode_wq,
> > + &ceph_inode(inode)->i_work));
> > +
> > + /* note: queue_work() cannot fail; it i_work were already
> > + * queued, then it would be holding another reference, but no
> > + * such reference exists
> > + */
> > +}
> > +
Powered by blists - more mailing lists