linux-kernel - Re: [RFC] mm: Proactive compaction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4b8b0cd5d7a246e9db1e1dd9b3bae7860d7ca2c0.camel@nvidia.com>
Date:   Mon, 16 Sep 2019 20:50:43 +0000
From:   Nitin Gupta <nigupta@...dia.com>
To:     "rientjes@...gle.com" <rientjes@...gle.com>
CC:     "keescook@...omium.org" <keescook@...omium.org>,
        "willy@...radead.org" <willy@...radead.org>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "aryabinin@...tuozzo.com" <aryabinin@...tuozzo.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "cai@....pw" <cai@....pw>,
        "arunks@...eaurora.org" <arunks@...eaurora.org>,
        "janne.huttunen@...ia.com" <janne.huttunen@...ia.com>,
        "jannh@...gle.com" <jannh@...gle.com>,
        "yuzhao@...gle.com" <yuzhao@...gle.com>,
        "mhocko@...e.com" <mhocko@...e.com>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        "guro@...com" <guro@...com>,
        "mgorman@...hsingularity.net" <mgorman@...hsingularity.net>,
        "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
        "khlebnikov@...dex-team.ru" <khlebnikov@...dex-team.ru>
Subject: Re: [RFC] mm: Proactive compaction

On Mon, 2019-09-16 at 13:16 -0700, David Rientjes wrote:
> On Fri, 16 Aug 2019, Nitin Gupta wrote:
> 
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> > 
> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> > 
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> > 
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> > 
> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> > 
> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays inactive
> > even if extfrag thresholds are not met.
> > 
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> > 
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> >  respectively.
> >  - Use a test program to fragment memory: the program allocates all memory
> >  and then for each 2M aligned section, frees 3/4 of base pages using
> >  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
> >  compaction till extfrag < extfrag_low for order-9.
> > 
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> > 
> 
> Is there an update to this proposal or non-RFC patch that has been posted 
> for proactive compaction?
> 
> We've had good success with periodically compacting memory on a regular 
> cadence on systems with hugepages enabled.  The cadence itself is defined 
> by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
> compaction in an attempt to keep zones as defragmented as possible 
> (perhaps more "proactive" than what is proposed here in an attempt to keep 
> all memory as unfragmented as possible regardless of extfrag thresholds).  
> It also avoids corner-cases where kcompactd could become more expensive 
> than what is anticipated because it is unsuccessful at compacting memory 
> yet the extfrag threshold is still exceeded.
> 
>  [*] Khugepaged instead of kcompactd only because this is only enabled
>      for systems where transparent hugepages are enabled, probably better
>      off in kcompactd to avoid duplicating work between two kthreads if
>      there is already a need for background compaction.
> 


Discussion on this RFC patch revolved around the issue of exposing too
many tunables (per-node, per-order, [low-high] extfrag thresholds). It
was sort-of concluded that no admin will get these tunables right for
a variety of workloads.

To eliminate the need for tunables, I proposed another patch:

https://patchwork.kernel.org/patch/11140067/

which does not add any tunables but extends and exports an existing
function (compact_zone_order). In summary, this new patch adds a
callback function which allows any driver to implement ad-hoc
compaction policies. There is also a sample driver which makes use
of this interface to keep hugepage external fragmentation within
specified range (exposed through debugfs):

https://gitlab.com/nigupta/linux/snippets/1894161

-Nitin