linux-kernel - Re: [RFC PATCH] mm: silence soft lockups from unlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wjYHvbOs9i39EnUsC6VEJiuJ2e_5gZB5-J5CRKxq80B_Q@mail.gmail.com>
Date:   Fri, 24 Jul 2020 16:25:56 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Oleg Nesterov <oleg@...hat.com>
Cc:     Hugh Dickins <hughd@...gle.com>, Michal Hocko <mhocko@...nel.org>,
        Linux-MM <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Michal Hocko <mhocko@...e.com>
Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page

On Fri, Jul 24, 2020 at 10:32 AM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
> Ok, that makes sense. Except you did it on top of the original patch
> without the fix to set WQ_FLAG_WOKEN for the non-wakeup case.

Hmm.

I just realized that one thing we could do is to not even test the
page bit for the shared case in the wakeup path.

Because anybody who uses the non-exclusive "wait_on_page_locked()" or
"wait_on_page_writeback()" isn't actually interested in the bit state
any more at that point. All they care about is that somebody cleared
it - not whether it was then re-taken again.

So instead of keeping them on the list - or stopping the waitqueue
walk because somebody else got the bit - we could just mark them
successfully done, wake them up, and remove those entries from the
list.

That would be better for everybody - less pointless waiting for a new
lock or writeback event, but also fewer entries on the wait queues as
we get rid of them more quickly instead of walking them over and over
just because somebody else re-took the page lock.

Generally "wait_on_page_locked()" is used for two things

 - either wait for the IO to then check if it's now uptodate

 - throttle things that can't afford to lock the page (eg page faults
that dropped the mm lock, and as such need to go through the whole
restart logic, but that don't want to lock the page because it's now
too late, but also the page migration things)

In the first case, waiting to actually seeing the locked bit clear is
pointless - the code only cared about the "wait for IO in progress"
not about the lock bit itself.

And that second case generally might want to retry, but doesn't want
to busy-loop.

And "wait_on_page_writeback()" is basically used for similar reasons
(ie check if there were IO errors, but also possibly to throttle
writeback traffic).

Saying "stop walking, keep it on the list" seems wrong. It makes IO
error handling and retries much worse, for example.

So it turns out that the wakeup logic and the initial wait logic don't
have so much in common after all, and there is a fundamental
conceptual difference between that "check bit one last time" case, and
the "we got woken up, now what" case..

End result: one final (yes, hopefully - I think I'm done) version of
this patch-series.

This not only makes the code cleaner (the generated code for
wake_up_page() is really quite nice now), but getting rid of extra
waiting might help the load that Michal reported.

Because a lot of page waiting might be that non-exclusive
"wait_on_page_locked()" kind, particularly in the thundering herd kind
of situation where one process starts IO, and then other processes
wait for it to finish.

Those users don't even care if somebody else then did a "lock_page()"
for some other reason (maybe for writeback). They are generally
perfectly happy with a locked page, as long as it's now up-to-date.

So this not only simplifies the code, it really might avoid some problems too.

               Linus

View attachment "0001-mm-rewrite-wait_on_page_bit_common-logic.patch" of type "text/x-patch" (6940 bytes)

View attachment "0002-list-add-list_del_init_careful-to-go-with-list_empty.patch" of type "text/x-patch" (2902 bytes)