linux-ext4 - Re: Kernel Benchmarking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wh5Lyr9Tr8wpNDXKeNt=Ngc3jwWaOsN_WbQr+1dAuhJSQ@mail.gmail.com>
Date:   Sat, 12 Sep 2020 19:39:31 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Dave Chinner <david@...morbit.com>
Cc:     Amir Goldstein <amir73il@...il.com>,
        Hugh Dickins <hughd@...gle.com>,
        Michael Larabel <Michael@...haellarabel.com>,
        "Ted Ts'o" <tytso@...gle.com>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Jan Kara <jack@...e.cz>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: Kernel Benchmarking

On Sat, Sep 12, 2020 at 5:41 PM Dave Chinner <david@...morbit.com> wrote:
>
> Hmmmm. So this is a typically a truncate race check, but this isn't
> sufficient to protect the fault against all page invalidation races
> as the page can be re-inserted into the same mapping at a different
> page->index now within EOF.

Using some "move" ioctl or similar and using a "invalidate page
mapping, then move it to a different point" model?

Yeah. I think that ends up being basically an extended special case of
the truncate thing (for the invalidate), and would require the
filesystem to serialize externally to the page anyway.

Which they would presumably already do with the MMAPLOCK or similar,
so I guess that's not a huge deal.

The real worry with (d) is that we are using the page lock for other
things too, not *just* the truncate check. Things where the inode lock
wouldn't be helping, like locking against throwing pages out of the
page cache entirely, or the hugepage splitting/merging etc. It's not
being truncated, it's just the VM shrinking the cache or modifying
things in other ways.

So I do worry a bit about trying to make things per-inode (or even
some per-range thing with a smarter lock) for those reasons. We use
the page lock not just for synchronizing with filesystem operations,
but for other page state synchronization too.

In many ways I think keeping it as a page-lock, and making the
filesystem operations just act on the range of pages would be safer.

But the page locking code does have some extreme downsides, exactly
because there are so _many_ pages and we end up having to play some
extreme size games due to that (ie the whole external hashing, but
also just not being able to use any debug locks etc, because we just
don't have the resources to do debugging locks at that kind of
granularity).

That's somewhat more longer-term. I'll try to do another version of
the "hybrid fairness" page lock (and/or just try some limited
optimistic spinning) to see if I can at least avoid the nasty
regression. Admittedly it really probably only happens for these kinds
of microbenchmarks that just hammer on one page over and over again,
but it's a big enough regression for a "real enough" load that I
really don't like it.

                 Linus