linux-kernel - Re: [PATCH 2/2] xfs: make sure link path does not go away at access

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211117002251.GR449541@dread.disaster.area>
Date:   Wed, 17 Nov 2021 11:22:51 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Brian Foster <bfoster@...hat.com>
Cc:     Ian Kent <raven@...maw.net>, Miklos Szeredi <miklos@...redi.hu>,
        xfs <linux-xfs@...r.kernel.org>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        David Howells <dhowells@...hat.com>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] xfs: make sure link path does not go away at access

On Tue, Nov 16, 2021 at 10:59:05AM -0500, Brian Foster wrote:
> On Tue, Nov 16, 2021 at 02:01:20PM +1100, Dave Chinner wrote:
> > On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > > If it isn't safe for ext4 to do that, then we have a general
> > > > pathwalk problem, not an XFS issue. But, as you say, it is safe
> > > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > > link buffer instead of freeing it, just like ext4 does.
> > > > 
> > > > As a side issue, we really don't want to move what XFS does in
> > > > .destroy_inode to .free_inode because that then means we need to
> > > > add synchronise_rcu() calls everywhere in XFS that might need to
> > > > wait on inodes being inactivated and/or reclaimed. And because
> > > > inode reclaim uses lockless rcu lookups, there's substantial
> > > > danger of adding rcu callback related deadlocks to XFS here.
> > > > That's just not a direction we should be moving in.
> > > 
> > > Another reason I decided to use the ECHILD return instead is that
> > > I thought synchronise_rcu() might add an unexpected delay.
> > 
> > It depends where you put the synchronise_rcu() call. :)
> > 
> > > Since synchronise_rcu() will only wait for processes that
> > > currently have the rcu read lock do you think that could actually
> > > be a problem in this code path?
> > 
> > No, I don't think it will.  The inode recycle case in XFS inode
> > lookup can trigger in two cases:
> > 
> > 1. VFS cache eviction followed by immediate lookup
> > 2. Inode has been unlinked and evicted, then free and reallocated by
> > the filesytsem.
> > 
> > In case #1, that's a cold cache lookup and hence delays are
> > acceptible (e.g. a slightly longer delay might result in having to
> > fetch the inode from disk again). Calling synchronise_rcu() in this
> > case is not going to be any different from having to fetch the inode
> > from disk...
> > 
> > In case #2, there's a *lot* of CPU work being done to modify
> > metadata (inode btree updates, etc), and so the operations can block
> > on journal space, metadata IO, etc. Delays are acceptible, and could
> > be in the order of hundreds of milliseconds if the transaction
> > subsystem is bottlenecked. waiting for an RCU grace period when we
> > reallocate an indoe immediately after freeing it isn't a big deal.
> > 
> > IOWs, if synchronize_rcu() turns out to be a problem, we can
> > optimise that separately - we need to correct the inode reuse
> > behaviour w.r.t. VFS RCU expectations, then we can optimise the
> > result if there are perf problems stemming from correct behaviour.
> > 
> 
> FWIW, with a fairly crude test on a high cpu count system, it's not that
> difficult to reproduce an observable degradation in inode allocation
> rate with a synchronous grace period in the inode reuse path, caused
> purely by a lookup heavy workload on a completely separate filesystem.
>
> The following is a 5m snapshot of the iget stats from a filesystem doing
> allocs/frees with an external/heavy lookup workload (which not included
> in the stats), with and without a sync grace period wait in the reuse
> path:
> 
> baseline:	ig 1337026 1331541 4 5485 0 5541 1337026
> sync_rcu_test:	ig 2955 2588 0 367 0 383 2955

The alloc/free part of the workload is a single threaded
create/unlink in a tight loop, yes?

This smells like a side effect of agressive reallocation of
just-freed XFS_IRECLAIMABLE inodes from the finobt that haven't had
their unlink state written back to disk yet. i.e. this is a corner
case in #2 above where a small set of inodes is being repeated
allocated and freed by userspace and hence being agressively reused
and never needing to wait for IO. i.e. a tempfile workload
optimisation...

> I think this is kind of the nature of RCU and why I'm not sure it's a
> great idea to rely on update side synchronization in a codepath that
> might want to scale/perform in certain workloads.

The problem here is not update side synchronisation. Root cause is
aggressive reallocation of recently freed VFS inodes via physical
inode allocation algorithms. Unfortunately, the RCU grace period
requirements of the VFS inode life cycle dictate that we can't
aggressively re-allocate and reuse freed inodes like this. i.e.
reallocation of a just-freed inode also has to wait for an RCU grace
period to pass before the in memory inode can be re-instantiated as
a newly allocated inode.

(Hmmmm - I wonder if of the other filesystems might have similar
problems with physical inode reallocation inside a RCU grace period?
i.e. without inode instance re-use, the VFS could potentially see
multiple in-memory instances of the same physical inode at the same
time.)

> I'm not totally sure
> if this will be a problem for real users running real workloads or not,
> or if this can be easily mitigated, whether it's all rcu or a cascading
> effect, etc. This is just a quick test so that all probably requires
> more test and analysis to discern.

This looks like a similar problem to what busy extents address - we
can't reuse a newly freed extent until the transaction containing
the EFI/EFD hit stable storage (and the discard operation on the
range is complete). Hence while a newly freed extent is
marked free in the allocbt, they can't be reused until they are
released from the busy extent tree.

I can think of several ways to address this, but let me think on it
a bit more.  I suspect there's a trick we can use to avoid needing
synchronise_rcu() completely by using the spare radix tree tag and
rcu grace period state checks with get_state_synchronize_rcu() and
poll_state_synchronize_rcu() to clear the radix tree tags via a
periodic radix tree tag walk (i.e. allocation side polling for "can
we use this inode" rather than waiting for the grace period to
expire once an inode has been selected and allocated.)

> > > 
> > > Sorry, I don't understand what you mean by the root cause not
> > > being identified?
> > 
> > The whole approach of "we don't know how to fix the inode reuse case
> > so disable it" implies that nobody has understood where in the reuse
> > case the problem lies. i.e. "inode reuse" by itself is not the root
> > cause of the problem.
> > 
> 
> I don't think anybody suggested to disable inode reuse.

Nobody did, so that's not what I was refering to. I was refering to
the patches for and comments advocating disabling .get_link for RCU
pathwalk because of the apparently unsolved problems stemming from
inode reuse...

> > The root cause is "allowing an inode to be reused without waiting
> > for an RCU grace period to expire". This might seem pedantic, but
> > "without waiting for an rcu grace period to expire" is the important
> > part of the problem (i.e. the bug), not the "allowing an inode to be
> > reused" bit.
> > 
> > Once the RCU part of the problem is pointed out, the solution
> > becomes obvious. As nobody had seen the obvious (wait for an RCU
> > grace period when recycling an inode) it stands to reason that
> > nobody really understood what the root cause of the inode reuse
> > problem.
> > 
> 
> The synchronize_rcu() approach was one of the first options discussed in
> the bug report once a reproducer was available.

What bug report would that be? :/

It's not one that I've read, and I don't recall seeing a pointer to
it anywhere in the path posting. IOWs, whatever discussion happened
in a private distro bug report can't be assumed as "general
knowledge" in an upstream discussion...

> AIUI, this is not currently a reproducible problem even before patch 1,
> which reduces the race window even further. Given that and the nak on
> the current patch (the justification for which I don't really
> understand), I'm starting to agree with Ian's earlier statement that
> perhaps it is best to separate this one so we can (hopefully) move patch
> 1 along on its own merit..

*nod*

The problem seems pretty rare, the pathwalk patch makes it
even rarer, so I think they can be separated just fine.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com