linux-kernel - Re: [PATCH] mm: Reduce memory bloat with THP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d3e77b2c-2164-743d-4f88-527091790006@oracle.com>
Date:   Fri, 15 Dec 2017 23:04:03 -0800
From:   Nitin Gupta <nitin.m.gupta@...cle.com>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     linux-mm@...ck.org, steven.sistare@...cle.com,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...nel.org>, Mel Gorman <mgorman@...e.de>,
        Nadav Amit <namit@...are.com>,
        Minchan Kim <minchan@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vegard Nossum <vegard.nossum@...cle.com>,
        "Levin, Alexander (Sasha Levin)" <alexander.levin@...izon.com>,
        Michal Hocko <mhocko@...e.com>,
        David Rientjes <rientjes@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        SeongJae Park <sj38.park@...il.com>, Shaohua Li <shli@...com>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Anshuman Khandual <khandual@...ux.vnet.ibm.com>,
        Rik van Riel <riel@...hat.com>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        Jan Kara <jack@...e.cz>, Dave Jiang <dave.jiang@...el.com>,
        Jérôme Glisse <jglisse@...hat.com>,
        Matthew Wilcox <willy@...ux.intel.com>,
        Hugh Dickins <hughd@...gle.com>, Tobin C Harding <me@...in.cc>,
        open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm: Reduce memory bloat with THP

On 12/15/17 2:00 AM, Kirill A. Shutemov wrote:
> On Thu, Dec 14, 2017 at 05:28:52PM -0800, Nitin Gupta wrote:
>> Currently, if the THP enabled policy is "always", or the mode
>> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
>> is allocated on a page fault if the pud or pmd is empty.  This
>> yields the best VA translation performance, but increases memory
>> consumption if some small page ranges within the huge page are
>> never accessed.
>>
>> An alternate behavior for such page faults is to install a
>> hugepage only when a region is actually found to be (almost)
>> fully mapped and active.  This is a compromise between
>> translation performance and memory consumption.  Currently there
>> is no way for an application to choose this compromise for the
>> page fault conditions above.
>>
>> With this change, when an application issues MADV_DONTNEED on a
>> memory region, the region is marked as "space-efficient". For
>> such regions, a hugepage is not immediately allocated on first
>> write.  Instead, it is left to the khugepaged thread to do
>> delayed hugepage promotion depending on whether the region is
>> actually mapped and active. When application issues
>> MADV_HUGEPAGE, the region is marked again as non-space-efficient
>> wherein hugepage is allocated on first touch.
> 
> I think this would be NAK. At least in this form.
> 
> What performance testing have you done? Any numbers?
> 

I wrote a throw-away code which mmaps 128G area and writes to a random
address in a loop. Together with writes, madvise(MADV_DONTNEED) are
issued at another random addresses. Writes are issued with 70%
probability and DONTNEED with 30%. With this test, I'm trying to emulate
workload of a large in-memory hash-table.

With the patch, I see that memory bloat is much less severe.
I've uploaded the test program with the memory usage plot here:

https://gist.github.com/nitingupta910/42ddf969e17556d74a14fbd84640ddb3

THP was set to 'always' mode in both cases but the result would be the
same if madvise mode was used instead.

> Making whole vma "space_efficient" just because somebody freed one page
> from it is just wrong. And there's no way back after this.
>

I'm using MADV_DONTNEED as a hint that although user wants to
transparently use hugepages but at the same time wants to be more
conservative with respect to memory usage. If a MADV_HUGEPAGE is issued
for a VMA range after any DONTNEEDs then the space_efficient bit is
again cleared, so we revert back to allocating hugepage on fault on
empty pud/pmd.

>>
>> Orabug: 26910556
> 
> Wat?
> 

It's oracle internal identifier used to track this work.

Thanks,
Nitin