linux-kernel - Re: pagecache locking (was: bcachefs status update) merged)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190619223756.GC26375@dread.disaster.area>
Date:   Thu, 20 Jun 2019 08:37:56 +1000
From:   Dave Chinner <david@...morbit.com>
To:     Jan Kara <jack@...e.cz>
Cc:     Amir Goldstein <amir73il@...il.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Kent Overstreet <kent.overstreet@...il.com>,
        Dave Chinner <dchinner@...hat.com>,
        "Darrick J . Wong" <darrick.wong@...cle.com>,
        Christoph Hellwig <hch@....de>,
        Matthew Wilcox <willy@...radead.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        linux-xfs <linux-xfs@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Josef Bacik <josef@...icpanda.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: pagecache locking (was: bcachefs status update) merged)

On Wed, Jun 19, 2019 at 12:38:38PM +0200, Jan Kara wrote:
> On Tue 18-06-19 07:21:56, Amir Goldstein wrote:
> > > > Right, but regardless of the spec we have to consider that the
> > > > behaviour of XFS comes from it's Irix heritage (actually from EFS,
> > > > the predecessor of XFS from the late 1980s)
> > >
> > > Sure. And as I mentioned, I think it's technically the nicer guarantee.
> > >
> > > That said, it's a pretty *expensive* guarantee. It's one that you
> > > yourself are not willing to give for O_DIRECT IO.
> > >
> > > And it's not a guarantee that Linux has ever had. In fact, it's not
> > > even something I've ever seen anybody ever depend on.
> > >
> > > I agree that it's possible that some app out there might depend on
> > > that kind of guarantee, but I also suspect it's much much more likely
> > > that it's the other way around: XFS is being unnecessarily strict,
> > > because everybody is testing against filesystems that don't actually
> > > give the total atomicity guarantees.
> > >
> > > Nobody develops for other unixes any more (and nobody really ever did
> > > it by reading standards papers - even if they had been very explicit).
> > >
> > > And honestly, the only people who really do threaded accesses to the same file
> > >
> > >  (a) don't want that guarantee in the first place
> > >
> > >  (b) are likely to use direct-io that apparently doesn't give that
> > > atomicity guarantee even on xfs
> > >
> > > so I do think it's moot.
> > >
> > > End result: if we had a really cheap range lock, I think it would be a
> > > good idea to use it (for the whole QoI implementation), but for
> > > practical reasons it's likely better to just stick to the current lack
> > > of serialization because it performs better and nobody really seems to
> > > want anything else anyway.
> > >
> > 
> > This is the point in the conversation where somebody usually steps in
> > and says "let the user/distro decide". Distro maintainers are in a much
> > better position to take the risk of breaking hypothetical applications.
> > 
> > I should point out that even if "strict atomic rw" behavior is desired, then
> > page cache warmup [1] significantly improves performance.
> > Having mentioned that, the discussion can now return to what is the
> > preferred way to solve the punch hole vs. page cache add race.
> > 
> > XFS may end up with special tailored range locks, which beings some
> > other benefits to XFS, but all filesystems need the solution for the punch
> > hole vs. page cache add race.
> > Jan recently took a stab at it for ext4 [2], but that didn't work out.
> 
> Yes, but I have idea how to fix it. I just need to push acquiring ext4's
> i_mmap_sem down a bit further so that only page cache filling is protected
> by it but not copying of data out to userspace. But I didn't get to coding
> it last week due to other stuff.
> 
> > So I wonder what everyone thinks about Kent's page add lock as the
> > solution to the problem.
> > Allegedly, all filesystems (XFS included) are potentially exposed to
> > stale data exposure/data corruption.
> 
> When we first realized that hole-punching vs page faults races are an issue I
> was looking into a generic solution but at that time there was a sentiment
> against adding another rwsem to struct address_space (or inode) so we ended
> up with a private lock in ext4 (i_mmap_rwsem), XFS (I_MMAPLOCK), and other
> filesystems these days. If people think that growing struct inode for
> everybody is OK, we can think about lifting private filesystem solutions
> into a generic one. I'm fine with that.

I'd prefer it doesn't get lifted to the VFS because I'm planning on
getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
likely to go away in the near term because a range lock can be
taken on either side of the mmap_sem in the page fault path.

> That being said as Dave said we use those fs-private locks also for
> serializing against equivalent issues arising for DAX. So the problem is
> not only about page cache but generally about doing IO and caching
> block mapping information for a file range. So the solution should not be
> too tied to page cache.

Yup, that was the point I was trying to make when Linus started
shouting at me about how caches work and how essential they are.  I
guess the fact that DAX doesn't use the page cache isn't as widely
known as I assumed it was...

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com