linux-kernel - Re: [PATCH v2] mm: Reduce memory bloat with THP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180125211303.rbfeg7ultwr6hpd3@suse.de>
Date:   Thu, 25 Jan 2018 21:13:03 +0000
From:   Mel Gorman <mgorman@...e.de>
To:     Nitin Gupta <nitin.m.gupta@...cle.com>
Cc:     Zi Yan <zi.yan@...rutgers.edu>, Michal Hocko <mhocko@...nel.org>,
        Nitin Gupta <nitingupta910@...il.com>,
        steven.sistare@...cle.com,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...nel.org>, Nadav Amit <namit@...are.com>,
        Minchan Kim <minchan@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vegard Nossum <vegard.nossum@...cle.com>,
        "Levin, Alexander" <alexander.levin@...izon.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Hillf Danton <hillf.zj@...baba-inc.com>,
        Shaohua Li <shli@...com>,
        Anshuman Khandual <khandual@...ux.vnet.ibm.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        Rik van Riel <riel@...hat.com>, Jan Kara <jack@...e.cz>,
        Dave Jiang <dave.jiang@...el.com>,
        J?r?me Glisse <jglisse@...hat.com>,
        Matthew Wilcox <willy@...ux.intel.com>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        Hugh Dickins <hughd@...gle.com>, Tobin C Harding <me@...in.cc>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
> >> It's not really about memory scarcity but a more efficient use of it.
> >> Applications may want hugepage benefits without requiring any changes to
> >> app code which is what THP is supposed to provide, while still avoiding
> >> memory bloat.
> >>
> > I read these links and find that there are mainly two complains:
> > 1. THP causes latency spikes, because direction compaction slows down THP allocation,
> > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than
> >    THP size and fails because of THP.
> >
> > The first complain is not related to this patch.
> 
> I'm trying to address many different THP issues and memory bloat is
> first among them.

Expecting userspace to get this right is probably going to go sideways.
It'll be screwed up and be sub-optimal or have odd semantics for existing
madvise flags. The fact is that an application may not even know if it's
going to be sparsely using memory in advance if it's a computation load
modelling from unknown input data.

I suggest you read the old Talluri paper "Superpassing the TLB Performance
of Superpages with Less Operating System Support" and pay attention to
Section 4. There it discusses a page reservation scheme whereby on fault
a naturally aligned set of base pages are reserved and only one correctly
placed base page is inserted into the faulting address. It was tied into
a hypothetical piece of hardware that doesn't exist to give best-effort
support for superpages so it does not directly help you but the initial
idea is sound. There are holes in the paper from todays perspective but
it was written in the 90's.

>From there, read "Transparent operating system support for superpages"
by Navarro, particularly chapter 4 paying attention to the parts where
it talks about opportunism and promotion threshold.

Superficially, it goes like this

1. On fault, reserve a THP in the allocator and use one base page that
   is correctly-aligned for the faulting addresses. By correctly-aligned,
   I mean that you use base page whose offset would be naturally contiguous
   if it ever was part of a huge page.
2. On subsequent faults, attempt to use a base page that is naturally
   aligned to be a THP
3. When a "threshold" of base pages are inserted, allocate the remaining
   pages and promote it to a THP
4. If there is memory pressure, spill "reserved" pages into the main
   allocation pool and lose the opportunity to promote (which will need
   khugepaged to recover)

By definition, a promotion threshold of 1 would be the existing scheme
of allocation a THP on the first fault and some users will want that. It
also should be the default to avoid unexpected overhead.  For workloads
where memory is being sparsely addressed and the increased overhead of
THP is unwelcome then the threshold should be tuned higher with a maximum
possible value of HPAGE_PMD_NR.

It's non-trivial to do this because at minimum a page fault has to check
if there is a potential promotion candidate by checking the PTEs around
the faulting address searching for a correctly-aligned base page that is
already inserted. If there is, then check if the correctly aligned base
page for the current faulting address is free and if so use it. It'll
also then need to check the remaining PTEs to see if both the promotion
threshold has been reached and if so, promote it to a THP (or else teach
khugepaged to do an in-place promotion if possible). In other words,
implementing the promotion threshold is both hard and it's not free.

However, if it did exist then the only tunable would be the "promotion
threshold" and applications would not need any special awareness of their
address space.

-- 
Mel Gorman
SUSE Labs