[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250827-kraut-anekdote-35789fddbb0b@brauner>
Date: Wed, 27 Aug 2025 16:57:26 +0200
From: Christian Brauner <brauner@...nel.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Josef Bacik <josef@...icpanda.com>, linux-fsdevel@...r.kernel.org,
linux-btrfs@...r.kernel.org, kernel-team@...com, linux-ext4@...r.kernel.org,
linux-xfs@...r.kernel.org, viro@...iv.linux.org.uk, amir73il@...il.com
Subject: Re: [PATCH v2 03/54] fs: rework iput logic
On Wed, Aug 27, 2025 at 04:18:55PM +0200, Mateusz Guzik wrote:
> On Wed, Aug 27, 2025 at 02:58:51PM +0200, Mateusz Guzik wrote:
> > On Tue, Aug 26, 2025 at 11:39:03AM -0400, Josef Bacik wrote:
> > > Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
> > > set, we will grab a reference on the inode again and then mark it dirty
> > > and then redo the put. This is to make sure we delay the time update
> > > for as long as possible.
> > >
> > > We can rework this logic to simply dec i_count if it is not 1, and if it
> > > is do the time update while still holding the i_count reference.
> > >
> > > Then we can replace the atomic_dec_and_lock with locking the ->i_lock
> > > and doing atomic_dec_and_test, since we did the atomic_add_unless above.
> > >
> > > Signed-off-by: Josef Bacik <josef@...icpanda.com>
> > > ---
> > > fs/inode.c | 23 ++++++++++++++---------
> > > 1 file changed, 14 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index a3673e1ed157..13e80b434323 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -1911,16 +1911,21 @@ void iput(struct inode *inode)
> > > if (!inode)
> > > return;
> > > BUG_ON(inode->i_state & I_CLEAR);
> > > -retry:
> > > - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> > > - if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > > - atomic_inc(&inode->i_count);
> > > - spin_unlock(&inode->i_lock);
> > > - trace_writeback_lazytime_iput(inode);
> > > - mark_inode_dirty_sync(inode);
> > > - goto retry;
> > > - }
> > > +
> > > + if (atomic_add_unless(&inode->i_count, -1, 1))
> > > + return;
> > > +
> > > + if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > > + trace_writeback_lazytime_iput(inode);
> > > + mark_inode_dirty_sync(inode);
> > > + }
> > > +
> > > + spin_lock(&inode->i_lock);
> > > + if (atomic_dec_and_test(&inode->i_count)) {
> > > + /* iput_final() drops i_lock */
> > > iput_final(inode);
> > > + } else {
> > > + spin_unlock(&inode->i_lock);
> > > }
> > > }
> > > EXPORT_SYMBOL(iput);
> > > --
> > > 2.49.0
> > >
> >
> > This changes semantics though.
> >
> > In the stock kernel the I_DIRTY_TIME business is guaranteed to be sorted
> > out before the call to iput_final().
> >
> > In principle the flag may reappear after mark_inode_dirty_sync() returns
> > and before the retried atomic_dec_and_lock succeeds, in which case it
> > will get cleared again.
> >
> > With your change the flag is only handled once and should it reappear
> > before you take the ->i_lock, it will stay there.
Yeah, good spotting.
> >
> > I agree the stock handling is pretty crap though.
> >
> > Your change should test the flag again after taking the spin lock but
> > before messing with the refcount and if need be unlock + retry.
> >
> > I would not hurt to assert in iput_final that the spin lock held and
> > that this flag is not set.
> >
> > Here is my diff to your diff to illustrate + a cosmetic change, not even
> > compile-tested:
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 421e248b690f..a9ae0c790b5d 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1911,7 +1911,7 @@ void iput(struct inode *inode)
> > if (!inode)
> > return;
> > BUG_ON(inode->i_state & I_CLEAR);
> > -
> > +retry:
> > if (atomic_add_unless(&inode->i_count, -1, 1))
> > return;
> >
> > @@ -1921,12 +1921,19 @@ void iput(struct inode *inode)
> > }
> >
> > spin_lock(&inode->i_lock);
> > +
> > + if (inode->i_count == 1 && inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > + spin_unlock(&inode->i_lock);
> > + goto retry;
> > + }
> > +
> > if (atomic_dec_and_test(&inode->i_count)) {
> > - /* iput_final() drops i_lock */
> > - iput_final(inode);
> > - } else {
> > spin_unlock(&inode->i_lock);
> > + return;
> > }
> > +
> > + /* iput_final() drops i_lock */
> > + iput_final(inode);
> > }
> > EXPORT_SYMBOL(iput);
> >
>
> Sorry for spam, but the more I look at this the more fucky the entire
> ordeal appears to me.
>
> Before I get to the crux, as a side note I did a quick check if atomics
> for i_count make any sense to begin with and I think they do, here is a
> sample output from a friend tracing the ref value on iput:
>
> bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_count.counter] = count(); }'
>
> @[5]: 66
> @[4]: 4625
> @[3]: 11086
> @[2]: 30937
> @[1]: 151785
>
> ... so plenty of non-last refs after all.
>
> I completely agree the mandatory ref trip to handle I_DIRTY_TIME is lame
> and needs to be addressed.
>
> But I'm uneasy about maintaining the invariant that iput_final() does
> not see the flag if i_nlink != 0 and my proposal as pasted is dodgy af
> on this front.
>
> While here some nits:
> 1. it makes sense to try mere atomics just in case someone else messed
> with the count between handling of the dirty flag and taking the spin lock
Which on mainline is a thing for sure.
> 2. according to my quick test with bpftrace the I_DIRTY_TIME flag is
> seen way less frequently than i_nlink != 0, so it makes sense to swap
> the order in which they are checked. Interested parties can try it out
> with:
> bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_nlink != 0, ((struct inode *)arg0)->i_state & (1 << 11)] = count(); }'
> 3. touch up the iput_final() unlock comment
>
> All that said, how about something like the thing below as the final
> routine building off of your change. I can't submit a proper patch and
> can't even compile-test. I don't need any credit should this get
> grabbed.
>
> void iput(struct inode *inode)
> {
> if (!inode)
> return;
> BUG_ON(inode->i_state & I_CLEAR);
> retry:
> if (atomic_add_unless(&inode->i_count, -1, 1))
> return;
>
> if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
> trace_writeback_lazytime_iput(inode);
> mark_inode_dirty_sync(inode);
> goto retry;
> }
>
> spin_lock(&inode->i_lock);
> if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
> spin_unlock(&inode->i_lock);
> goto retry;
> }
>
> if (!atomic_dec_and_test(&inode->i_count)) {
> spin_unlock(&inode->i_lock);
> return;
> }
>
> /*
> * iput_final() drops ->i_lock, we can't assert on it as the inode may
> * be deallocated by the time it returns
> */
> iput_final(inode);
> }
I've taken this. Though I had Josef convince me that the retry is sane
and doesn't end up stealing a ref. Thanks.
Powered by blists - more mailing lists