linux-ext4 - Re: [PATCH v2 03/54] fs: rework iput logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <n6z2jkdgmgm2xfxc7y3a2a7psnkeboziffkt6bjoggrff4dlxe@vpsyl3ky6w6v>
Date: Wed, 27 Aug 2025 16:18:55 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Josef Bacik <josef@...icpanda.com>
Cc: linux-fsdevel@...r.kernel.org, linux-btrfs@...r.kernel.org, 
	kernel-team@...com, linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org, 
	brauner@...nel.org, viro@...iv.linux.org.uk, amir73il@...il.com
Subject: Re: [PATCH v2 03/54] fs: rework iput logic

On Wed, Aug 27, 2025 at 02:58:51PM +0200, Mateusz Guzik wrote:
> On Tue, Aug 26, 2025 at 11:39:03AM -0400, Josef Bacik wrote:
> > Currently, if we are the last iput, and we have the I_DIRTY_TIME bit
> > set, we will grab a reference on the inode again and then mark it dirty
> > and then redo the put.  This is to make sure we delay the time update
> > for as long as possible.
> > 
> > We can rework this logic to simply dec i_count if it is not 1, and if it
> > is do the time update while still holding the i_count reference.
> > 
> > Then we can replace the atomic_dec_and_lock with locking the ->i_lock
> > and doing atomic_dec_and_test, since we did the atomic_add_unless above.
> > 
> > Signed-off-by: Josef Bacik <josef@...icpanda.com>
> > ---
> >  fs/inode.c | 23 ++++++++++++++---------
> >  1 file changed, 14 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index a3673e1ed157..13e80b434323 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1911,16 +1911,21 @@ void iput(struct inode *inode)
> >  	if (!inode)
> >  		return;
> >  	BUG_ON(inode->i_state & I_CLEAR);
> > -retry:
> > -	if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> > -		if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > -			atomic_inc(&inode->i_count);
> > -			spin_unlock(&inode->i_lock);
> > -			trace_writeback_lazytime_iput(inode);
> > -			mark_inode_dirty_sync(inode);
> > -			goto retry;
> > -		}
> > +
> > +	if (atomic_add_unless(&inode->i_count, -1, 1))
> > +		return;
> > +
> > +	if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > +		trace_writeback_lazytime_iput(inode);
> > +		mark_inode_dirty_sync(inode);
> > +	}
> > +
> > +	spin_lock(&inode->i_lock);
> > +	if (atomic_dec_and_test(&inode->i_count)) {
> > +		/* iput_final() drops i_lock */
> >  		iput_final(inode);
> > +	} else {
> > +		spin_unlock(&inode->i_lock);
> >  	}
> >  }
> >  EXPORT_SYMBOL(iput);
> > -- 
> > 2.49.0
> > 
> 
> This changes semantics though.
> 
> In the stock kernel the I_DIRTY_TIME business is guaranteed to be sorted
> out before the call to iput_final().
> 
> In principle the flag may reappear after mark_inode_dirty_sync() returns
> and before the retried atomic_dec_and_lock succeeds, in which case it
> will get cleared again.
> 
> With your change the flag is only handled once and should it reappear
> before you take the ->i_lock, it will stay there.
> 
> I agree the stock handling is pretty crap though.
> 
> Your change should test the flag again after taking the spin lock but
> before messing with the refcount and if need be unlock + retry.
> 
> I would not hurt to assert in iput_final that the spin lock held and
> that this flag is not set.
> 
> Here is my diff to your diff to illustrate + a cosmetic change, not even
> compile-tested:
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 421e248b690f..a9ae0c790b5d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1911,7 +1911,7 @@ void iput(struct inode *inode)
>  	if (!inode)
>  		return;
>  	BUG_ON(inode->i_state & I_CLEAR);
> -
> +retry:
>  	if (atomic_add_unless(&inode->i_count, -1, 1))
>  		return;
>  
> @@ -1921,12 +1921,19 @@ void iput(struct inode *inode)
>  	}
>  
>  	spin_lock(&inode->i_lock);
> +
> +	if (inode->i_count == 1 && inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> +		spin_unlock(&inode->i_lock);
> +		goto retry;
> +	}
> +
>  	if (atomic_dec_and_test(&inode->i_count)) {
> -		/* iput_final() drops i_lock */
> -		iput_final(inode);
> -	} else {
>  		spin_unlock(&inode->i_lock);
> +		return;
>  	}
> +
> +	/* iput_final() drops i_lock */
> +	iput_final(inode);
>  }
>  EXPORT_SYMBOL(iput);
>  

Sorry for spam, but the more I look at this the more fucky the entire
ordeal appears to me.

Before I get to the crux, as a side note I did a quick check if atomics
for i_count make any sense to begin with and I think they do, here is a
sample output from a friend tracing the ref value on iput:

bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_count.counter] = count(); }'

@[5]: 66
@[4]: 4625
@[3]: 11086
@[2]: 30937
@[1]: 151785

... so plenty of non-last refs after all.

I completely agree the mandatory ref trip to handle I_DIRTY_TIME is lame
and needs to be addressed.

But I'm uneasy about maintaining the invariant that iput_final() does
not see the flag if i_nlink != 0 and my proposal as pasted is dodgy af
on this front.

While here some nits:
1. it makes sense to try mere atomics just in case someone else messed
with the count between handling of the dirty flag and taking the spin lock
2. according to my quick test with bpftrace the I_DIRTY_TIME flag is
seen way less frequently than i_nlink != 0, so it makes sense to swap
the order in which they are checked. Interested parties can try it out
with:
bpftrace -e 'kprobe:iput /arg0 != 0/ { @[((struct inode *)arg0)->i_nlink != 0, ((struct inode *)arg0)->i_state & (1 << 11)] = count(); }'
3. touch up the iput_final() unlock comment

All that said, how about something like the thing below as the final
routine building off of your change. I can't submit a proper patch and
can't even compile-test. I don't need any credit should this get
grabbed.

void iput(struct inode *inode)
{
        if (!inode)
                return;
        BUG_ON(inode->i_state & I_CLEAR);
retry:
        if (atomic_add_unless(&inode->i_count, -1, 1))
                return;

        if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
                trace_writeback_lazytime_iput(inode);
                mark_inode_dirty_sync(inode);
                goto retry;
        }

        spin_lock(&inode->i_lock);
        if ((inode->i_state & I_DIRTY_TIME) && inode->i_nlink) {
                spin_unlock(&inode->i_lock);
                goto retry;
        }

        if (!atomic_dec_and_test(&inode->i_count)) {
                spin_unlock(&inode->i_lock);
                return;
        }

        /*
         * iput_final() drops ->i_lock, we can't assert on it as the inode may
         * be deallocated by the time it returns
         */
        iput_final(inode);
}