linux-kernel - MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181206091405.GD1286@dhcp22.suse.cz>
Date:   Thu, 6 Dec 2018 10:14:06 +0100
From:   Michal Hocko <mhocko@...nel.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Andrea Arcangeli <aarcange@...hat.com>,
        mgorman@...hsingularity.net, Vlastimil Babka <vbabka@...e.cz>,
        ying.huang@...el.com, s.priebe@...fihost.ag,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        alex.williamson@...hat.com, lkp@...org,
        David Rientjes <rientjes@...gle.com>, kirill@...temov.name,
        Andrew Morton <akpm@...ux-foundation.org>,
        zi.yan@...rutgers.edu
Subject: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891:
 vm-scalability.throughput -61.3% regression)

On Wed 05-12-18 16:58:02, Linus Torvalds wrote:
[...]
> I realize that we probably do want to just have explicit policies that
> do not exist right now, but what are (a) sane defaults, and (b) sane
> policies?

I would focus on the current default first (which is defrag=madvise).
This means that we only try the cheapest possible THP without
MADV_HUGEPAGE. If there is none we simply fallback. We do restrict to
the local node. I guess there is a general agreement that this is a sane
default.

MADV_HUGEPAGE changes the picture because the caller expressed a need
for THP and is willing to go extra mile to get it. That involves
allocation latency and as of now also a potential remote access. We do
not have complete agreement on the later but the prevailing argument is
that any strong NUMA locality is just reinventing node-reclaim story
again or makes THP success rate down the toilet (to quote Mel). I agree
that we do not want to fallback to a remote node overeagerly. I believe
that something like the below would be sensible
	1) THP on a local node with compaction not giving up too early
	2) THP on a remote node in NOWAIT mode - so no direct
	   compaction/reclaim (trigger kswapd/kcompactd only for
	   defrag=defer+madvise)
	3) fallback to the base page allocation

This would allow both full memory utilization and try to be as local as
possible. Whoever strongly prefers NUMA locality should be using
MPOL_NODE_RECLAIM (or similar) and that would skip 2 and make 1) and 2)
use more aggressive compaction and reclaim.

This will also fit into our existing NUMA api. MPOL_NODE_RECLAIM
wouldn't be restricted to THP obviously. It would act on base pages as
well and it would basically use the same implementation as we have for
the global node_reclaim and make it usable again.

Does this sound at least remotely sane?
-- 
Michal Hocko
SUSE Labs