linux-kernel - Re: [PATCH 1/2] sched/wait: Break up long wake list walk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFy=4y0fq9nL2WR1x8vwzJrDOdv++r036LXpR=6Jx8jpzg@mail.gmail.com>
Date:   Tue, 22 Aug 2017 11:19:12 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     "Liang, Kan" <kan.liang@...el.com>
Cc:     Mel Gorman <mgorman@...hsingularity.net>,
        Mel Gorman <mgorman@...e.de>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...e.hu>, Andi Kleen <ak@...ux.intel.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>, Jan Kara <jack@...e.cz>,
        linux-mm <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 10:23 AM, Liang, Kan <kan.liang@...el.com> wrote:
>
> Although the patch doesn't trigger watchdog, the spin lock wait time
> is not small (0.45s).
> It may get worse again on larger systems.

Yeah, I don't think Mel's patch is great - because I think we could do
so much better.

What I like about Mel's patch is that it recognizes that
"wait_on_page_locked()" there is special, and replaces it with
something else. I think that "something else" is worse than my
"yield()" call, though.

In particular, it wastes CPU time even in the good case, and the
process that will unlock the page may actually be waiting for us to
reschedule. It may be CPU bound, but it might well have just been
preempted out.

So if we do busy loops, I really think we should also make sure that
the thing we're waiting for is not preempted.

HOWEVER, I'm actually starting to think that there is perhaps
something else going on.

Let me walk you through my thinking:

This is the migration logic:

 (a) migration locks the page

 (b) migration is supposedly CPU-limited

 (c) migration then unlocks the page.

Ignore all the details, that's the 10.000 ft view. Right?

Now, if the above is right, then I have a question for people:

  HOW IN THE HELL DO WE HAVE TIME FOR THOUSANDS OF THREADS TO HIT THAT ONE PAGE?

That just sounds really sketchy to me. Even if all those thousands of
threads are runnable, we need to schedule into them just to get them
to wait on that one page.

So that sounds really quite odd when migration is supposed to hold the
page lock for a relatively short time and get out. Don't you agree?

Which is why I started thinking of what the hell could go on for that
long wait-queue to happen.

One thing that strikes me is that the way wait_on_page_bit() works is
that it will NOT wait until the next bit clearing, it will wait until
it actively *sees* the page bit being clear.

Now, work with me on that. What's the difference?

What we could have is some bad NUMA balancing pattern that actually
has a page that everybody touches.

And hey, we pretty much know that everybody touches that page, since
people get stuck on that wait-queue, right?

And since everybody touches it, as a result everybody eventually
thinks that page should be migrated to their NUMA node.

But for all we know, the migration keeps on failing, because one of
the points of that "lock page - try to move - unlock page" is that
*TRY* in "try to move". There's a number of things that makes it not
actually migrate. Like not being movable, or failing to isolate the
page, or whatever.

So we could have some situation where we end up locking and unlocking
the page over and over again (which admittedly is already a sign of
something wrong in the NUMA balancing, but that's a separate issue).

And if we get into that situation, where everybody wants that one hot
page, what happens to the waiters?

One of the thousands of waiters is unlucky (remember, this argument
started with the whole "you shouldn't get that many waiters on one
single page that isn't even locked for that long"), and goes:

 (a) Oh, the page is locked, I will wait for the lock bit to clear

 (b) go to sleep

 (c) the migration fails, the lock bit is cleared, the waiter is woken
up but doesn't get the CPU immediately, and one of the other
*thousands* of threads decides to also try to migrate (see above),

 (d) the guy waiting for the lock bit to clear will see the page
"still" locked (really just "locked again") and continue to wait.

In the meantime, one of the other threads happens to be unlucky, also
hits the race, and now we have one more thread waiting for that page
lock. It keeps getting unlocked, but it also keeps on getting locked,
and so the queue can keep growing.

See where I'm going here? I think it's really odd how *thousands* of
threads can hit that locked window that is supposed to be pretty
small. But I think it's much more likely if we have some kind of
repeated event going on.

So I'm starting to think that part of the problem may be how stupid
that "wait_for_page_bit_common()" code is. It really shouldn't wait
until it sees that the bit is clear. It could have been cleared and
then re-taken.

And honestly, we actually have extra code for that "let's go round
again". That seems pointless. If the bit has been cleared, we've been
woken up, and nothing else would have done so anyway, so if we're not
interested in locking, we're simply *done* after we've done the
"io_scheduler()".

So I propose testing the attached trivial patch. It may not do
anything at all. But the existing code is actually doing extra work
just to be fragile, in case the scenario above can happen.

Comments?

                Linus

View attachment "patch.diff" of type "text/plain" (572 bytes)