linux-kernel - Re: [PATCH 1/2] sched/wait: Break up long wake list walk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFw_-RmdWF6mPHonnqoJcMEmjhvjzcwp5OU7Uwzk3KPNmw@mail.gmail.com>
Date:   Tue, 22 Aug 2017 15:52:17 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Andi Kleen <ak@...ux.intel.com>
Cc:     Christopher Lameter <cl@...ux.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "Liang, Kan" <kan.liang@...el.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Mel Gorman <mgorman@...e.de>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Ingo Molnar <mingo@...e.hu>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>, Jan Kara <jack@...e.cz>,
        linux-mm <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen <ak@...ux.intel.com> wrote:
>
> I believe in this case it's used by threads, so a reference count limit
> wouldn't help.

For the first migration try, yes. But if it's some kind of "try and
try again" pattern, the second time you try and there are people
waiting for the page, the page count (not the map count) would be
elevanted.

So it's possible that depending on exactly what the deeper problem is,
the "this page is very busy, don't migrate" case might be
discoverable, and the page count might be part of it.

However, after PeterZ made that comment that page migration should
have that should_numa_migrate_memory() filter, I am looking at that
mpol_misplaced() code.

And honestly, that MPOL_PREFERRED / MPOL_F_LOCAL case really looks
like complete garbage to me.

It looks like garbage exactly because it says "always migrate to the
current node", but that's crazy - if it's a group of threads all
running together on the same VM, that obviously will just bounce the
page around for absolute zero good ewason.

The *other* memory policies look fairly sane. They basically have a
fairly well-defined preferred node for the policy (although the
"MPOL_INTERLEAVE" looks wrong for a hugepage).  But
MPOL_PREFERRED/MPOL_F_LOCAL really looks completely broken.

Maybe people expected that anybody who uses MPOL_F_LOCAL will also
bind all threads to one single node?

Could we perhaps make that "MPOL_PREFERRED / MPOL_F_LOCAL" case just
do the MPOL_F_MORON policy, which *does* use that "should I migrate to
the local node" filter?

IOW, we've been looking at the waiters (because the problem shows up
due to the excessive wait queues), but maybe the source of the problem
comes from the numa balancing code just insanely bouncing pages
back-and-forth if you use that "always balance to local node" thing.

Untested (as always) patch attached.

              Linus

View attachment "patch.diff" of type "text/plain" (840 bytes)