linux-kernel - Re: [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for multiple pending writebacks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=whj+ixuJ7+_h42CRssvsuzHaMsYf-2LjYBaM4dRax7cyQ@mail.gmail.com>
Date:   Mon, 11 Jan 2021 11:03:27 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Hugh Dickins <hughd@...gle.com>
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        stable <stable@...r.kernel.org>,
        syzbot <syzbot+2fc0712f8f8b8b8fa0ef@...kaller.appspotmail.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>, stable@...nel.org
Subject: Re: [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for
 multiple pending writebacks

On Mon, Jan 11, 2021 at 9:55 AM Hugh Dickins <hughd@...gle.com> wrote:
>
> I think it's too early to push this one through to stable:
> Linus mentioned on Friday that Michael Larabel of Phoronix
> has observed a performance regression from this commit.

That turned out to be a red herring. Yes, Michael saw a performance
regression on some machines, but the change to go back to the old
model (where the repeat was basically at wakeup time rather than in
the waiter) didn't actually make any difference.

And the issue only showed on a couple of machines, and only with
certain configurations (ie apparently switching to the performance
governor made it go away).

So it seems to have been some modal behavior about random timing
(possibly just interaction with cpufreq) rather than a real
regression.

I think the real issue is simply that some loads are very sensitive to
the exact writeback timing patterns. And I think we're making things
worse by having some of the patterns be very non-deterministic indeed.

For example, long before we actually take the page lock and then wait
for (and set) the page writeback bit, look at how we first use the
Xarray tags to turn the "page dirty" tag into "page needs writeback"
tag, and then look up an array of such writeback pages: all without
any real locking at all (apart from the xas lock itself for the
tagging op).

Making things even less deterministic, the code that doesn't do
writeback - but just _wait_ for writeback - doesn't actually serialize
with the page lock at all. It _just_ does that
"wait_for_page_writeback()", which is ambiguous when there are
consecutive writebacks happening. That's actually the case that I
think Michael would have seen - because he obviously never saw the
(very very rare) BUG_ON.

The BUG_ON() in page writeback itself is serialized by the page lock
and so there aren't really many possibilities for that to get
contention or other odd behavior (the wakeup race being the one very
very unlikely notable one). In contrast, the "wait for writeback"
isn't serialized by anything else, so that one is literally "if was at
writeback at some point, wait for it to no longer be", and then the
aggressive wakeup was good for that case, while it caused problems for
the writeback case.

Anyway, the numbers are all ambiguous, the one-liner fix is not
horrible, and the take-away from all of this is likely mostly: it
would be good to have some more clarity about the whole writeback and
wait-for-writeback thing.

In many ways it would be really line to have a sequence count rather
than just a single bit. But obviously that does not work for 'struct
page'.

Anyway, don't hold up this "get rid of BUG_ON() in writeback" patch for this.

              Linus