linux-ext4 - Re: Kernel Benchmarking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whjhYa3ig0U_mtpoU5Zok_2Y5zTCw8f-THkf1vHRBDNuA@mail.gmail.com>
Date:   Sat, 12 Sep 2020 10:59:40 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Amir Goldstein <amir73il@...il.com>,
        Hugh Dickins <hughd@...gle.com>
Cc:     Michael Larabel <Michael@...haellarabel.com>,
        "Ted Ts'o" <tytso@...gle.com>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Jan Kara <jack@...e.cz>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: Kernel Benchmarking

Sorry, I should have put much more background when I started cc'ing
people and lists..

On Sat, Sep 12, 2020 at 12:28 AM Amir Goldstein <amir73il@...il.com> wrote:
>
> Can you please add a reference to the original problem report and
> to the offending commit? This conversation appeared on the list without
> this information.
>
> Are filesystems other than ext4 also affected by this performance
> regression?

So let me expand on this, because it actually comes from a really old
problem that has been around for ages, and that I think I finally
understand.

And sadly, the regression comes from the fix to that old problem.

So I think the VM people (but perhaps not necessarily all filesystem
people) have been aware of a long-time problem with certain loads
causing huge latencies, up to and including watchdogs firing because
processes wouldn't make progress for over half a minute (or whatever
the default blocking watchdog timeout is - I would like to say that
it's some odd number like 22 seconds, but maybe that was RCU).

We've known it's related to long queues for the page lock, and about
three years ago now we added a "bookmark" entry to the page wakeup
queues, because those queues got so long that even just traversing the
wakeup queue was a big latency hit. But it's generally been some heavy
private load on a customer machine, and nobody ever really had a good
test-case for it.

We've actually had tons of different page lockers involved. One of the
suspects (and in fact I think it really was one of the causes, just
not the only one) was the NUMA migration, where under certain loads
with lots and lots of threads, the kernel would decide to try to
migrate a hot page, and lots of threads would come in and all
NUMA-fault on it (because it was some core page everything used), and
as part of the fault they would get the page lock to serialize, and
you'd end up with wait queues that were multiple _thousands_ of
entries long.

So the reports of watchdogs firing go back many many years, and over
the years we've had various band-aid fixes - things that really do
help the symptoms a lot, but really seem to be fixes for the symptoms
rather than something fundamental. That "let's break up the wait queue
with a bookmark so that we can at least enable interrupts" is perhaps
the best example of code that just shouldn't exist, but comes about
because there's been incredible contention on the page lock.

See commits 2554db916586 ("sched/wait: Break up long wake list walk")
and 11a19c7b099f ("sched/wait: Introduce wakeup bookmark in
wake_up_page_bit") for that bookmark thing and some of the list
numbers.

There's been a few actual fixes too - I think Hugh Dickins really
ended up fixing at least part of the NUMA balancing case by changing
some of the reference counting. So I don't think it's _all_ been
band-aids, but the page lock has been a thing that has come up
multiple times over the years.

See for example commit 9a1ea439b16b ("mm:
put_and_wait_on_page_locked() while page is migrated") for a patch
that ended up hopefully fixing at least one of the causes of the long
queues during migration. I say "hopefully", because (again) the loads
that cause these things were those "internal customer load" things
that we don't really have a lot of insight into. Hugh has been
involved over the years presumably exactly because google has been one
of those customers, although not the only one by far.

But the point here is that the page lock has been problematic for
years - with those reports of watchdogs (after tens of seconds!)
firing going back long before the fixes above. It's definitely not a
new thing, although I think it has perhaps become more common due to
"bigger machines running more complex loads becoming more common", but
who knows..

Anyway, for various reasons I was looking at this again a couple of
months ago: we had _yet_ another report of softlockups:

  https://lore.kernel.org/lkml/20200721063258.17140-1-mhocko@kernel.org/

and we had an unrelated thread about a low-level race in page wakeup
(the original report was wrong, but it led to figuring out another
race):

  https://lore.kernel.org/lkml/20200624161142.GA12184@redhat.com/

and there was something else going on too that I can't recall, that
had made me look at the page locking.

And while there, I realized that the simplest explanation for all
those years of softlockups was simply that the page waiting is very
very unfair indeed.

While somebody is patiently waiting for a page, another process can
(and will) come in and get the page lock from under it, and the
original waiter will end up just re-queueing - at the end of the list.
Which explains how you can get those half-minute latencies - not
because any page lock holder really holds the lock for very long
(almost all of them are CPU-bound, not IO bound), but because under
heavy load and contention, you end up with the poor waiters scheduling
away and *new* page lockers end up being treated very preferentially,
with absolutely nothing keeping them from taking the lock while
somebody else is waiting for it.

ANYWAY. That's a long email of background for the commit that I then
put in the tree this merge window:

  2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")

which actually makes the page locking truly fair (well, there's a
small window that allows new lockers to come in, but now it's about
CPU instructions rather than "scheduling a sleeping process to run",
so it's essentially gone as a major and recurring unfairness issue).

And that commit actually had a fair number of positive reports: even
before merging it, Hugh had tested this on yet another "company
internal load" (*cough*google*cough*) that used to show watchdog
failures, and it did seem to solve those, and some of the kernel test
robot benchmarks improved by tens of percent, in one case by 160%.

So everything looked fine. It solves a long-running problem, and for
once I think it really _solves_ it by fixing a fundamental problem,
rather than papering over the symptoms.

But there were a couple of nagging concerns. Hackbench showed wildly
fluctuating performance: some tests improved by a lot, others
regressed by a lot. Not a huge deal, I felt: hackbench isn't a great
benchmark, and the performance fluctuations really seemed to be going
both ways and be very dependent on exact test and machine. So it was a
slight concern, but on the whole not really worth worrying about.

But the reason Michal is on the Cc is because the Phoronix benchmark
suite showed a rather marked decrease in the apache test. Unlike
hackbench, I think that's much more of a "real" test, and it seemed to
be a lot more consistent too. So I asked for profiles (and eventually
just recreated the test locally), and I think I understand what's
going on.

It's the fairness.

Fairness is good, but fairness is usually bad for performance even if
it does get rid of the worst-case issues. In this case, it's _really_
bad for performance, because that page lock has always been unfair,
and we have a lot of patterns that have basically come to
(unintentionally) depend on that unfairness.

In particular, the page locking is often used for just verifying
simple things, with the most common example being "lock page, check
that the mapping is still valid, insert page into page tables, unlock
page".

The reason the apache benchmark regresses is that it basically does a
web server test with a single file ("test.html") that gets served by
just mmap'ing it, and sending it out that way. Using lots of threads,
and using lots of different mappings. So they *all* fault on the read
of that page, and they *all* do that "lock page, check that the
mapping is valid, insert page" dance.

That actually worked ok - not great, but ok - when the page lock was
unfair, and anybody runnable would basically just get it. Yes, people
would occasionally get put on the wait-queue, but those waiting
lockers wouldn't really affect the other ones that are going through
that dance since they would just take the lock anyway. VERY unfair,
but hey, very nice for that load.

It works much less well when the page lock is suddenly fair, and if
anybody starts waiting for it, gets the lock handed to it when the
page is unlocked. Now the page is owned by the next waiter in line,
and they're sleeping, and new page lockers don't magically and
unfairly get to just bypass the older waiter.

This is not a new issue. We've had exactly the same thing happen when
we made spinlocks, semaphores, and rwlocks be fair.

And like those other times, we had to make them fair because *not*
making them fair caused those unacceptable outliers under contention,
to the point of starvation and watchdogs firing.

Anyway, I don't have a great solution. I have a few options (roughly
ordered by "simplest to most complex"):

 (a) just revert
 (b) add some busy-spinning
 (c) reader-writer page lock
 (d) try to de-emphasize the page lock

but I'd love to hear comments.

Honestly, (a) is trivial to do. We've had the problem for years, the
really *bad* cases are fairly rare, and the workarounds mostly work.
Yeah, you get watchdogs firing, but it's not exactly _common_.

But equally honestly, I hate (a). I feel like this change really fixed
a fundamental issue, and after looking at the apache benchmark, in
many ways it's not a great benchmark. The reason it shows such a
(relatively) huge regression is that it hammers on just a single small
file. So my inclination is to say "we know how to fix the performance
regression, even if we may not be able to do so for 5.9, and this
benchmark behavior is very unlikely to actually hit a real load".

Option (b) is just because right now the page lock is very much a
black-and-white "try to lock once or sleep". Where most lockers (the
initial actual IO to fill the page being the main exception) are
CPU-bound, not IO bound. So spinning is the usual simplistic fix for
locking behavior like that. It doesn't really "fix" anything, but it
helps the bad contended performance case and we wouldn't get the
scheduling and sleeping behavior.

I can imagine coming up with a ten-liner patch to add some spinning
that claws back much of the performance on that benchmark. Maybe.

I don't like option (b) very much, but it might be the band-aid for
5.9 if we feel that the benchmark results _might_ translate to real
loads.

Option (c) is, I feel, the best one. Reader-writer locks aren't
wonderful, but the page lock really tends to have two very distinct
uses: exclusive for the initial IO and for the (very very unlikely)
truncate and hole punching issues, and then the above kind of "lock to
check that it's still valid" use, which is very very common and
happens on every page fault and then some. And it would be very
natural to make the latter be a read-lock (or even just a sequence
counting one with retry rather than a real lock).

Option (d) is "we already have a locking in many filesystems that give
us exclusion between faulting in a page, and the truncate/hole punch,
so we shouldn't use the page lock at all".

I do think that the locking that filesystems do is in many ways
inferior - it's done on a per-inode basis rather than on a per-page
basis. But if the filesystems end up doing that *anyway*, what's the
advantage of the finer granularity one? And *because* the common case
is all about the reading case, the bigger granularity tends to work
very well in practice, and basically never sees contention.

So I think option (c) is potentially technically better because it has
smaller locking granularity, but in practice (d) might be easier and
we already effectively do it for several filesystems.

Also, making the page lock be a rw-lock may be "easy" in theory, but
in practice we have the usual "uhhuh, 'struct page' is very crowded,
and finding even just one more bit in the flags to use as a read bit
is not great, and finding a whole reader _count_ would likely require
us to go to that hashed queue, which we know has horrendous cache
behavior from past experience".

This turned out to be a very long email, and probably most people
didn't get this far. But if you did, comments, opinions, suggestions?

Any other suggestions than those (a)-(d) ones above?

               Linus