linux-ext4 - Re: Kernel Benchmarking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200913234503.GS12096@dread.disaster.area>
Date:   Mon, 14 Sep 2020 09:45:03 +1000
From:   Dave Chinner <david@...morbit.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Amir Goldstein <amir73il@...il.com>,
        Hugh Dickins <hughd@...gle.com>,
        Michael Larabel <Michael@...haellarabel.com>,
        Ted Ts'o <tytso@...gle.com>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Jan Kara <jack@...e.cz>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 07:39:31PM -0700, Linus Torvalds wrote:
> On Sat, Sep 12, 2020 at 5:41 PM Dave Chinner <david@...morbit.com> wrote:
> >
> > Hmmmm. So this is a typically a truncate race check, but this isn't
> > sufficient to protect the fault against all page invalidation races
> > as the page can be re-inserted into the same mapping at a different
> > page->index now within EOF.
> 
> Using some "move" ioctl or similar and using a "invalidate page
> mapping, then move it to a different point" model?

Right, that's the sort of optimisation we could do inside a
FALLOC_FL_{COLLAPSE,INSERT}_RANGE operation if we wanted to preserve
the page cache contents instead of invalidating it.

> Yeah. I think that ends up being basically an extended special case of
> the truncate thing (for the invalidate), and would require the
> filesystem to serialize externally to the page anyway.

*nod*

> Which they would presumably already do with the MMAPLOCK or similar,
> so I guess that's not a huge deal.
> 
> The real worry with (d) is that we are using the page lock for other
> things too, not *just* the truncate check. Things where the inode lock
> wouldn't be helping, like locking against throwing pages out of the
> page cache entirely, or the hugepage splitting/merging etc. It's not
> being truncated, it's just the VM shrinking the cache or modifying
> things in other ways.

Yes, that is a problem, and us FS people don't know/see all the
places this can occur. We generally find out about them when one of
our regression stress tests trips over a data corruption. :(

I have my doubts that complex page cache manipulation operations
like ->migrate_page that rely exclusively on page and internal mm
serialisation are really safe against ->fallocate based invalidation
races.  I think they probably also need to be wrapped in the
MMAPLOCK, but I don't understand all the locking and constraints
that ->migrate_page has and there's been no evidence yet that it's a
problem so I've kinda left that alone. I suspect that "no evidence"
thing comes from "filesystem people are largely unable to induce
page migrations in regression testing" so it has pretty much zero
test coverage....

Stuff like THP splitting hasn't been an issue for us because the
file-backed page cache does not support THP (yet!). That's
something I'll be looking closely at in Willy's upcoming patchset.

> So I do worry a bit about trying to make things per-inode (or even
> some per-range thing with a smarter lock) for those reasons. We use
> the page lock not just for synchronizing with filesystem operations,
> but for other page state synchronization too.

Right, I'm not suggesting the page lock goes away, just saying that
we actually need two levels of locking for file-backed pages - one
filesystem, one page level - and that carefully selecting where we
"aggregate" the locking for complex multi-object operations might
make the overall locking simpler.

> In many ways I think keeping it as a page-lock, and making the
> filesystem operations just act on the range of pages would be safer.

Possibly, but that "range of pages" lock still doesn't really solve
the filesystem level serialisation problem.  We have to prevent page
faults from running over a range even when there aren't pages in the
page cache over that range (i.e. after we invalidate the range).
Hence we cannot rely on anything struct page related - the
serialisation mechanism has to be external to the cached pages
themselves, but it also has to integrate cleanly into the existing
locking and transaction ordering constraints we have.

> But the page locking code does have some extreme downsides, exactly
> because there are so _many_ pages and we end up having to play some
> extreme size games due to that (ie the whole external hashing, but
> also just not being able to use any debug locks etc, because we just
> don't have the resources to do debugging locks at that kind of
> granularity).

*nod*

The other issue here is that serialisation via individual cache
object locking just doesn't scale in any way to the sizes of
operations that fallocate() can run. fallocate() has 64 bit
operands, so a user could ask us to lock down a full 8EB range of
file. Locking that page by page, even using 1GB huge page Xarray
slot entries, is just not practical... :/

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com