linux-kernel - Re: [PATCH 28/28] xfs: rework unreferenced inode lookups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191115172600.GC55854@bfoster>
Date:   Fri, 15 Nov 2019 12:26:00 -0500
From:   Brian Foster <bfoster@...hat.com>
To:     Dave Chinner <david@...morbit.com>
Cc:     linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 28/28] xfs: rework unreferenced inode lookups

On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote:
> On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote:
> > On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@...hat.com>
> > > 
> > > Looking up an unreferenced inode in the inode cache is a bit hairy.
> > > We do this for inode invalidation and writeback clustering purposes,
> > > which is all invisible to the VFS. Hence we can't take reference
> > > counts to the inode and so must be very careful how we do it.
> > > 
> > > There are several different places that all do the lookups and
> > > checks slightly differently. Fundamentally, though, they are all
> > > racy and inode reclaim has to block waiting for the inode lock if it
> > > loses the race. This is not very optimal given all the work we;ve
> > > already done to make reclaim non-blocking.
> > > 
> > > We can make the reclaim process nonblocking with a couple of simple
> > > changes. If we define the unreferenced lookup process in a way that
> > > will either always grab an inode in a way that reclaim will notice
> > > and skip, or will notice a reclaim has grabbed the inode so it can
> > > skip the inode, then there is no need for reclaim to need to cycle
> > > the inode ILOCK at all.
> > > 
> > > Selecting an inode for reclaim is already non-blocking, so if the
> > > ILOCK is held the inode will be skipped. If we ensure that reclaim
> > > holds the ILOCK until the inode is freed, then we can do the same
> > > thing in the unreferenced lookup to avoid inodes in reclaim. We can
> > > do this simply by holding the ILOCK until the RCU grace period
> > > expires and the inode freeing callback is run. As all unreferenced
> > > lookups have to hold the rcu_read_lock(), we are guaranteed that
> > > a reclaimed inode will be noticed as the trylock will fail.
> > > 
> > ...
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@...hat.com>
> > > ---
> > >  fs/xfs/mrlock.h     |  27 +++++++++
> > >  fs/xfs/xfs_icache.c |  88 +++++++++++++++++++++--------
> > >  fs/xfs/xfs_inode.c  | 131 +++++++++++++++++++++-----------------------
> > >  3 files changed, 153 insertions(+), 93 deletions(-)
> > > 
> > > diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h
> > > index 79155eec341b..1752a2592bcc 100644
> > > --- a/fs/xfs/mrlock.h
> > > +++ b/fs/xfs/mrlock.h
> > ...
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 11bf4768d491..45ee3b5cd873 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -106,6 +106,7 @@ xfs_inode_free_callback(
> > >  		ip->i_itemp = NULL;
> > >  	}
> > >  
> > > +	mrunlock_excl_non_owner(&ip->i_lock);
> > >  	kmem_zone_free(xfs_inode_zone, ip);
> > >  }
> > >  
> > > @@ -132,6 +133,7 @@ xfs_inode_free(
> > >  	 * free state. The ip->i_flags_lock provides the barrier against lookup
> > >  	 * races.
> > >  	 */
> > > +	mrupdate_non_owner(&ip->i_lock);
> > 
> > Can we tie these into the proper locking interface using flags? For
> > example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER)
> > or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps?
> 
> I'd prefer not to make this part of the common locking interface -
> it's a one off special use case, not something we want to progate
> elsewhere into the code.
> 

What urks me about this is that it obfuscates rather than highlights
that fact because I have no idea what mrtryupdate_non_owner() actually
does without looking it up. We could easily name a flag
XFS_ILOCK_PENDING_RECLAIM or something similarly ridiculous to make it
blindingly obvious it should only be used in a special context.

> Now that I think over it, I probably should have tagged this with
> patch with [RFC]. I think we should just get rid of the mrlock
> wrappers rather than add more, and that would simplify this a lot.
> 

Yeah, FWIW I've been reviewing this patch as a WIP on top of all of the
nonblocking bits as opposed to being some fundamental part of that work.
That aside, I also agree that cleaning up these wrappers might address
that concern because something like:

	/* open code ilock because ... */
	down_write_trylock_non_owner(&ip->i_lock);

... is at least readable code.

> 
> > >  	spin_lock(&ip->i_flags_lock);
> > >  	ip->i_flags = XFS_IRECLAIM;
> > >  	ip->i_ino = 0;
> > > @@ -295,11 +297,24 @@ xfs_iget_cache_hit(
> > >  		}
> > >  
> > >  		/*
> > > -		 * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
> > > -		 * from stomping over us while we recycle the inode. Remove it
> > > -		 * from the LRU straight away so we can re-init the VFS inode.
> > > +		 * Before we reinitialise the inode, we need to make sure
> > > +		 * reclaim does not pull it out from underneath us. We already
> > > +		 * hold the i_flags_lock, and because the XFS_IRECLAIM is not
> > > +		 * set we know the inode is still on the LRU. However, the LRU
> > > +		 * code may have just selected this inode to reclaim, so we need
> > > +		 * to ensure we hold the i_flags_lock long enough for the
> > > +		 * trylock in xfs_inode_reclaim_isolate() to fail. We do this by
> > > +		 * removing the inode from the LRU, which will spin on the LRU
> > > +		 * list locks until reclaim stops walking, at which point we
> > > +		 * know there is no possible race between reclaim isolation and
> > > +		 * this lookup.
> > > +		 *
> > 
> > Somewhat related to my question about the lru_lock on the earlier patch.
> 
> *nod*
> 
> The caveat here is that this is the slow path so spinning for a
> while doesn't really matter.
> 
> > > @@ -1022,19 +1076,7 @@ xfs_dispose_inode(
> > >  	spin_unlock(&pag->pag_ici_lock);
> > >  	xfs_perag_put(pag);
> > >  
> > > -	/*
> > > -	 * Here we do an (almost) spurious inode lock in order to coordinate
> > > -	 * with inode cache radix tree lookups.  This is because the lookup
> > > -	 * can reference the inodes in the cache without taking references.
> > > -	 *
> > > -	 * We make that OK here by ensuring that we wait until the inode is
> > > -	 * unlocked after the lookup before we go ahead and free it.
> > > -	 *
> > > -	 * XXX: need to check this is still true. Not sure it is.
> > > -	 */
> > > -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > >  	xfs_qm_dqdetach(ip);
> > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> > Ok, so I'm staring at this a bit more and think I'm missing something.
> > If we put aside the change to hold ilock until the inode is freed, we
> > basically have the following (simplified) flow as the inode goes from
> > isolation to disposal:
> > 
> > 	ilock	(isolate)
> > 	iflock
> > 	set XFS_IRECLAIM
> > 	ifunlock (disposal)
> > 	iunlock
> > 	radix delete
> > 	ilock cycle (drain)
> > 	rcu free
> > 
> > What we're trying to eliminate is the ilock cycle to drain any
> > concurrent unreferenced lookups from accessing the inode once it is
> > freed. The free itself is still RCU protected.
> > 
> > Looking over at the ifree path, we now have something like this:
> > 
> > 	rcu_read_lock()
> > 	radix lookup
> > 	check XFS_IRECLAIM
> > 	ilock
> > 	if XFS_ISTALE, skip
> > 	set XFS_ISTALE
> > 	rcu_read_unlock()
> > 	iflock
> > 	/* return locked down inode */
> 
> You missed a lock.
> 
> 	rcu_read_lock()
> 	radix lookup
> >>>	i_flags_lock
> 	check XFS_IRECLAIM
> 	ilock
> 	if XFS_ISTALE, skip
> 	set XFS_ISTALE
> >>>	i_flags_unlock
> 	rcu_read_unlock()
> 	iflock
> 
> > Given that we set XFS_IRECLAIM under ilock, would we still need either
> > the ilock cycle or to hold ilock through the RCU free if the ifree side
> > (re)checked XFS_IRECLAIM after it has the ilock (but before it drops the
> > rcu read lock)?
> 
> We set XFS_IRECLAIM under the i_flags_lock.
> 
> It is the combination of rcu_read_lock() and i_flags_lock() that
> provides the RCU lookup state barriers - the ILOCK is not part of
> that at all.
> 
> The key point here is that once we've validated the inode we found
> in the radix tree under the i_flags_lock, we then take the ILOCK,
> thereby serialising the taking of the ILOCK here with the taking of
> the ILOCK in the reclaim isolation code.
> 
> i.e. all the reclaim state serialisation is actually based around
> holding the i_flags_lock, not the ILOCK. 
> 
> Once we have grabbed the ILOCK under the i_flags_lock, we can
> drop the i_flags_lock knowing that reclaim will not be able isolate
> this inode and set XFS_IRECLAIM.
> 

Hmm, Ok. I knew i_flags_lock was in there when I wrote this up. I
intentionally left it out as a simplification because it wasn't clear to
me that it was a critical part of the lookup. I'll keep this in mind the
next time I walk through this code.

> > ISTM we should either have a non-reclaim inode with
> > ilock protection or a reclaim inode with RCU protection (so we can skip
> > it before it frees), but I could easily be missing something here..
> 
> Heh. Yeah, it's a complex dance, and it's all based around how
> RCU lookups and the i_flags_lock interact to provide coherent
> detection of freed inodes.
> 
> I have a nagging feeling that this whole ILOCK-held-to-rcu-free game
> can be avoided. I need to walk myself through the lookup state
> machine again and determine if ordering the XFS_IRECLAIM flag check
> after greabbing the ILOCK is sufficient to prevent ifree/iflush
> lookups from accessing the inode outside the rcu_read_lock()
> context.
> 

That's pretty much what I was wondering...

> If so, most of this patch will go away....
> 
> > > +	 * attached to the buffer so we don't need to do anything more here.
> > >  	 */
> > > -	if (ip != free_ip) {
> > > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > > -			rcu_read_unlock();
> > > -			delay(1);
> > > -			goto retry;
> > > -		}
> > > -
> > > -		/*
> > > -		 * Check the inode number again in case we're racing with
> > > -		 * freeing in xfs_reclaim_inode().  See the comments in that
> > > -		 * function for more information as to why the initial check is
> > > -		 * not sufficient.
> > > -		 */
> > > -		if (ip->i_ino != inum) {
> > > +	if (__xfs_iflags_test(ip, XFS_ISTALE)) {
> > 
> > Is there a correctness reason for why we move the stale check to under
> > ilock (in both iflush/ifree)?
> 
> It's under the i_flags_lock, and so I moved it up under the lookup
> hold of the i_flags_lock so we don't need to cycle it again.
> 

Yeah, but in both cases it looks like it moved to under the ilock as
well, which comes after i_flags_lock. IOW, why grab ilock for stale
inodes when we're just going to skip them?

Brian

> > >  	/*
> > > -	 * We don't need to attach clean inodes or those only with unlogged
> > > -	 * changes (which we throw away, anyway).
> > > +	 * We don't need to attach clean inodes to the buffer - they are marked
> > > +	 * stale in memory now and will need to be re-initialised by inode
> > > +	 * allocation before they can be reused.
> > >  	 */
> > >  	if (!ip->i_itemp || xfs_inode_clean(ip)) {
> > >  		ASSERT(ip != free_ip);
> > >  		xfs_ifunlock(ip);
> > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +		if (ip != free_ip)
> > > +			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> > There's an assert against this case just above, though I suppose there's
> > nothing wrong with just keeping it and making the functional code more
> > cautious.
> 
> *nod*
> 
> It follows Darrick's lead of making sure that production kernels
> don't do something stupid because of some whacky corruption we
> didn't expect to ever see.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@...morbit.com
>