linux-kernel - RE: [PATCH 1/2] sched/wait: Break up long wake list walk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <37D7C6CF3E00A74B8858931C1DB2F0775378A8BB@SHSMSX103.ccr.corp.intel.com>
Date:   Wed, 23 Aug 2017 14:51:13 +0000
From:   "Liang, Kan" <kan.liang@...el.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Andi Kleen <ak@...ux.intel.com>
CC:     Christopher Lameter <cl@...ux.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Mel Gorman <mgorman@...e.de>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Ingo Molnar <mingo@...e.hu>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>, Jan Kara <jack@...e.cz>,
        linux-mm <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk


> Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk
> 
> On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen <ak@...ux.intel.com> wrote:
> >
> > I believe in this case it's used by threads, so a reference count
> > limit wouldn't help.
> 
> For the first migration try, yes. But if it's some kind of "try and try again"
> pattern, the second time you try and there are people waiting for the page,
> the page count (not the map count) would be elevanted.
> 
> So it's possible that depending on exactly what the deeper problem is, the
> "this page is very busy, don't migrate" case might be discoverable, and the
> page count might be part of it.
> 
> However, after PeterZ made that comment that page migration should have
> that should_numa_migrate_memory() filter, I am looking at that
> mpol_misplaced() code.
> 
> And honestly, that MPOL_PREFERRED / MPOL_F_LOCAL case really looks like
> complete garbage to me.
> 
> It looks like garbage exactly because it says "always migrate to the current
> node", but that's crazy - if it's a group of threads all running together on the
> same VM, that obviously will just bounce the page around for absolute zero
> good ewason.
> 
> The *other* memory policies look fairly sane. They basically have a fairly
> well-defined preferred node for the policy (although the
> "MPOL_INTERLEAVE" looks wrong for a hugepage).  But
> MPOL_PREFERRED/MPOL_F_LOCAL really looks completely broken.
> 
> Maybe people expected that anybody who uses MPOL_F_LOCAL will also
> bind all threads to one single node?
> 
> Could we perhaps make that "MPOL_PREFERRED / MPOL_F_LOCAL" case just
> do the MPOL_F_MORON policy, which *does* use that "should I migrate to
> the local node" filter?
> 
> IOW, we've been looking at the waiters (because the problem shows up due
> to the excessive wait queues), but maybe the source of the problem comes
> from the numa balancing code just insanely bouncing pages back-and-forth if
> you use that "always balance to local node" thing.
> 
> Untested (as always) patch attached.

The patch doesn’t work.

Thanks,
Kan