linux-kernel - Re: [PATCH 1/2] sched/wait: Break up long wake list walk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFy_RNx5TQ8esjPPOKuW-o+fXbZgWapau2MHyexcAZtqsw@mail.gmail.com>
Date:   Thu, 17 Aug 2017 13:44:40 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     "Liang, Kan" <kan.liang@...el.com>, Mel Gorman <mgorman@...e.de>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Cc:     Tim Chen <tim.c.chen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...e.hu>, Andi Kleen <ak@...ux.intel.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>, Jan Kara <jack@...e.cz>,
        linux-mm <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan <kan.liang@...el.com> wrote:
>
> Here is the call stack of wait_on_page_bit_common
> when the queue is long (entries >1000).
>
> # Overhead  Trace output
> # ........  ..................
> #
>    100.00%  (ffffffff931aefca)
>             |
>             ---wait_on_page_bit
>                __migration_entry_wait
>                migration_entry_wait
>                do_swap_page
>                __handle_mm_fault
>                handle_mm_fault
>                __do_page_fault
>                do_page_fault
>                page_fault

Hmm. Ok, so it does seem to very much be related to migration. Your
wake_up_page_bit() profile made me suspect that, but this one seems to
pretty much confirm it.

So it looks like that wait_on_page_locked() thing in
__migration_entry_wait(), and what probably happens is that your load
ends up triggering a lot of migration (or just migration of a very hot
page), and then *every* thread ends up waiting for whatever page that
ended up getting migrated.

And so the wait queue for that page grows hugely long.

Looking at the other profile, the thing that is locking the page (that
everybody then ends up waiting on) would seem to be
migrate_misplaced_transhuge_page(), so this is _presumably_ due to
NUMA balancing.

Does the problem go away if you disable the NUMA balancing code?

Adding Mel and Kirill to the participants, just to make them aware of
the issue, and just because their names show up when I look at blame.

              Linus