linux-kernel - Re: [PATCH] vfs: check dentry is still valid in get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Yecp3DspJOkhaDGV@bfoster>
Date:   Tue, 18 Jan 2022 15:58:04 -0500
From:   Brian Foster <bfoster@...hat.com>
To:     Al Viro <viro@...iv.linux.org.uk>
Cc:     Ian Kent <raven@...maw.net>, "Darrick J. Wong" <djwong@...nel.org>,
        Christoph Hellwig <hch@....de>,
        Miklos Szeredi <miklos@...redi.hu>,
        David Howells <dhowells@...hat.com>,
        Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        xfs <linux-xfs@...r.kernel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH] vfs: check dentry is still valid in get_link()

On Tue, Jan 18, 2022 at 07:20:35PM +0000, Al Viro wrote:
> On Tue, Jan 18, 2022 at 01:25:15PM -0500, Brian Foster wrote:
> 
> > If I go back to the inactive -> reclaimable grace period variant and
> > also insert a start_poll_synchronize_rcu() and
> > poll_state_synchronize_rcu() pair across the inactive processing
> > sequence, I start seeing numbers closer to ~36k cycles. IOW, the
> > xfs_inodegc_inactivate() helper looks something like this:
> > 
> >         if (poll_state_synchronize_rcu(ip->i_destroy_gp))
> >                 xfs_inodegc_set_reclaimable(ip);
> >         else
> >                 call_rcu(&VFS_I(ip)->i_rcu, xfs_inodegc_set_reclaimable_callback);
> > 
> > ... to skip the rcu grace period if one had already passed while the
> > inode sat on the inactivation queue and was processed.
> > 
> > However my box went haywire shortly after with rcu stall reports or
> > something if I let that continue to run longer, so I'll probably have to
> > look into that lockdep splat (complaining about &pag->pag_ici_lock in
> > rcu context, perhaps needs to become irq safe?) or see if something else
> > is busted..
> 
> Umm...  Could you (or Dave) describe where does the mainline do the
> RCU delay mentioned several times in these threads, in case of
> 	* unlink()
> 	* overwriting rename()
> 	* open() + unlink() + close() (that one, obviously, for regular files)
> 

If you're referring to the existing rcu delay in XFS, I suspect that
refers to xfs_reclaim_inode(). The last bits of that function remove the
inode from the perag tree and then calls __xfs_inode_free(), which frees
the inode via rcu callback.

BTW, I think the experiment above is either going to require an audit to
make the various _set_reclaimable() locks irq safe or something a bit
more ugly like putting the inode back on a workqueue after the rcu delay
to make the state transition. Given the incremental improvement from
using start_poll_synchronize_rcu(), I replaced the above destroy side
code with a cond_synchronize_rcu(ip->i_destroy_gp) call on the
iget/recycle side and see similar results (~36k cycles per 60s, but so
far without any explosions).

Brian

> The thing is, if we already do have an RCU delay in there, there might be
> a better solution - making sure it happens downstream of d_drop() (in case
> when dentry has other references) or dentry going negative (in case we are
> holding the sole reference to it).
> 
> If we can do that, RCU'd dcache lookups won't run into inode reuse at all.
> Sure, right now d_delete() comes last *and* we want the target inode to stay
> around past the call of ->unlink().  But if you look at do_unlinkat() you'll
> see an interesting-looking wart with ihold/iput around vfs_unlink().  Not
> sure if you remember the story on that one; basically, it's "we'd rather
> have possible IO on inode freeing to happen outside of the lock on parent".
> 
> nfsd and mqueue do the same thing; ksmbd does not.  OTOH, ksmbd appears to
> force the "make victim go unhashed, sole reference or not".  ecryptfs
> definitely does that forcing (deliberately so).
> 
> That area could use some rethinking, and if we can deal with the inode reuse
> delay while we are at it...
> 
> Overwriting rename() is also interesting in that respect, of course.
> 
> I can go and try to RTFS my way through xfs iget-related code, but I'd
> obviously prefer to do so with at least some overview of that thing
> from the folks familiar with it.  Seeing that it's a lockless search
> structure, missing something subtle there is all too easy, especially
> with the lookup-by-fhandle stuff in the mix...
>