linux-kernel - Re: [PATCH v2] mm: Reduce memory bloat with THP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ce7c1498-9f28-2eb0-67b7-ade9b04b8e2b@oracle.com>
Date:   Fri, 19 Jan 2018 12:59:17 -0800
From:   Nitin Gupta <nitin.m.gupta@...cle.com>
To:     Michal Hocko <mhocko@...nel.org>,
        Nitin Gupta <nitingupta910@...il.com>
Cc:     steven.sistare@...cle.com,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...nel.org>, Mel Gorman <mgorman@...e.de>,
        Nadav Amit <namit@...are.com>,
        Minchan Kim <minchan@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vegard Nossum <vegard.nossum@...cle.com>,
        "Levin, Alexander (Sasha Levin)" <alexander.levin@...izon.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Hillf Danton <hillf.zj@...baba-inc.com>,
        Shaohua Li <shli@...com>,
        Anshuman Khandual <khandual@...ux.vnet.ibm.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        Rik van Riel <riel@...hat.com>, Jan Kara <jack@...e.cz>,
        Dave Jiang <dave.jiang@...el.com>,
        Jérôme Glisse <jglisse@...hat.com>,
        Matthew Wilcox <willy@...ux.intel.com>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        Hugh Dickins <hughd@...gle.com>, Tobin C Harding <me@...in.cc>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On 1/19/18 4:49 AM, Michal Hocko wrote:
> On Thu 18-01-18 15:33:16, Nitin Gupta wrote:
>> From: Nitin Gupta <nitin.m.gupta@...cle.com>
>>
>> Currently, if the THP enabled policy is "always", or the mode
>> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
>> is allocated on a page fault if the pud or pmd is empty.  This
>> yields the best VA translation performance, but increases memory
>> consumption if some small page ranges within the huge page are
>> never accessed.
> 
> Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always
> users.
>  

Yes, allocating hugepage on first touch is the current behavior for
above two cases. However, I see issues with this current behavior.
Firstly, THP=always mode is often too aggressive/wasteful to be useful
for any realistic workloads. For THP=madvise, users may want to back
active parts of memory region with hugepages while avoiding aggressive
hugepage allocation on first touch. Or, they may really want the current
behavior.

With this patch, users would have the option to pick what behavior they
want by passing hints to the kernel in the form of MADV_HUGEPAGE and
MADV_DONTNEED madvise calls.


>> An alternate behavior for such page faults is to install a
>> hugepage only when a region is actually found to be (almost)
>> fully mapped and active.  This is a compromise between
>> translation performance and memory consumption.  Currently there
>> is no way for an application to choose this compromise for the
>> page fault conditions above.
> 
> Is that really true? We have /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> This is not reflected during the PF of course but you can control the
> behavior there as well. Either by the global setting or a per proces
> prctl.
> 

I think this part of patch description needs some rewording. This patch
is to change *only* the page fault behavior.

Once pages are installed, khugepaged does its job as usual, using
max_ptes_none and other config values. I'm not trying to change any
khugepaged behavior here.


>> With this change, whenever an application issues MADV_DONTNEED on a
>> memory region, the region is marked as "space-efficient". For such
>> regions, a hugepage is not immediately allocated on first write.
> 
> Kirill didn't like it in the previous version and I do not like this
> either. You are adding a very subtle side effect which might completely
> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
> to free up unused memory. Now you have put it out of THP usage
> basically.
>

Userpsace may want a region to be considered by khugepaged while opting
out of hugepage allocation on first touch. Asking userspace memory
allocators to have to track and reclaim unused parts of a THP allocated
hugepage does not seems right, as the kernel can use simple userspace
hints to avoid allocating extra memory in the first place.

I agree that this patch is adding a subtle side-effect which may take
some applications by surprise. However, I often see the opposite too:
for many workloads, disabling THP is the first advise as this aggressive
allocation of hugepages on first touch is unexpected and is too
wasteful. For e.g.:

1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/

2) Disable THP on MongoDB
https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

3) Disable THP for Couchbase Server
https://blog.couchbase.com/often-overlooked-linux-os-tweaks/

4) Redis
http://antirez.com/news/84


> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
> 

It's not really about memory scarcity but a more efficient use of it.
Applications may want hugepage benefits without requiring any changes to
app code which is what THP is supposed to provide, while still avoiding
memory bloat.

-Nitin