linux-kernel - Re: MAP_POPULATE vs. MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4e1011d9-aef3-5cd7-1424-b81aa79128cb@scylladb.com>
Date:   Thu, 16 Mar 2017 15:26:54 +0200
From:   Avi Kivity <avi@...lladb.com>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: MAP_POPULATE vs. MADV_HUGEPAGES

On 03/16/2017 02:34 PM, Michal Hocko wrote:
> On Wed 15-03-17 18:50:32, Avi Kivity wrote:
>> A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores
>> (4 NUMA nodes).  The kernel ends up spinning in isolate_freepages_block().
> Which kernel version is that?

A good question; it was 3.10.something-el.something.  The user mentioned 
above updated to 4.4, and the problem was gone, so it looks like it is a 
Red Hat specific problem.  I would really like the 3.10.something kernel 
to handle this workload well, but I understand that's not this list's 
concern.

> What is the THP defrag mode
> (/sys/kernel/mm/transparent_hugepage/defrag)?

The default (always).

>   
>> I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE
>> won't be seen until after mmap() completes, with pages already populated.
>> Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?
> Why do you need MADV_HUGEPAGE?

So that I get huge pages even if transparent_hugepage/enabled=madvise.  
I'm allocating almost all of the memory of that machine to be used as a 
giant cache, so I want it backed by hugepages.

>   
>> Is my only option to serialize those memory allocations, and fault in those
>> pages manually?  Or perhaps use mlock()?
> I am still not 100% sure I see what you are trying to achieve, though.
> So you do not want all those processes to contend inside the compaction
> while still allocate as many huge pages as possible?

Since the process starts with all of that memory free, there should not 
be any compaction going on (or perhaps very minimal eviction/movement of 
a few pages here and there).  And since it's fixed in later kernels, it 
looks like the contention was not really mandated by the workload, just 
an artifact of the implementation.

To explain the workload again, the process starts, clones as many 
threads as there are logical processors, and each of those threads 
mmap()s (and mbind()s) a chunk of memory and then proceeds to touch it.