[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4b8b0cd5d7a246e9db1e1dd9b3bae7860d7ca2c0.camel@nvidia.com>
Date: Mon, 16 Sep 2019 20:50:43 +0000
From: Nitin Gupta <nigupta@...dia.com>
To: "rientjes@...gle.com" <rientjes@...gle.com>
CC: "keescook@...omium.org" <keescook@...omium.org>,
"willy@...radead.org" <willy@...radead.org>,
"vbabka@...e.cz" <vbabka@...e.cz>,
"aryabinin@...tuozzo.com" <aryabinin@...tuozzo.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"hannes@...xchg.org" <hannes@...xchg.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"cai@....pw" <cai@....pw>,
"arunks@...eaurora.org" <arunks@...eaurora.org>,
"janne.huttunen@...ia.com" <janne.huttunen@...ia.com>,
"jannh@...gle.com" <jannh@...gle.com>,
"yuzhao@...gle.com" <yuzhao@...gle.com>,
"mhocko@...e.com" <mhocko@...e.com>,
"gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
"guro@...com" <guro@...com>,
"mgorman@...hsingularity.net" <mgorman@...hsingularity.net>,
"dan.j.williams@...el.com" <dan.j.williams@...el.com>,
"khlebnikov@...dex-team.ru" <khlebnikov@...dex-team.ru>
Subject: Re: [RFC] mm: Proactive compaction
On Mon, 2019-09-16 at 13:16 -0700, David Rientjes wrote:
> On Fri, 16 Aug 2019, Nitin Gupta wrote:
>
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> >
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> >
> > /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> >
> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> >
> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays inactive
> > even if extfrag thresholds are not met.
> >
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> >
> > Testing done (on x86):
> > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> > - Use a test program to fragment memory: the program allocates all memory
> > and then for each 2M aligned section, frees 3/4 of base pages using
> > munmap.
> > - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
> > compaction till extfrag < extfrag_low for order-9.
> >
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> >
>
> Is there an update to this proposal or non-RFC patch that has been posted
> for proactive compaction?
>
> We've had good success with periodically compacting memory on a regular
> cadence on systems with hugepages enabled. The cadence itself is defined
> by the admin but it causes khugepaged[*] to periodically wakeup and invoke
> compaction in an attempt to keep zones as defragmented as possible
> (perhaps more "proactive" than what is proposed here in an attempt to keep
> all memory as unfragmented as possible regardless of extfrag thresholds).
> It also avoids corner-cases where kcompactd could become more expensive
> than what is anticipated because it is unsuccessful at compacting memory
> yet the extfrag threshold is still exceeded.
>
> [*] Khugepaged instead of kcompactd only because this is only enabled
> for systems where transparent hugepages are enabled, probably better
> off in kcompactd to avoid duplicating work between two kthreads if
> there is already a need for background compaction.
>
Discussion on this RFC patch revolved around the issue of exposing too
many tunables (per-node, per-order, [low-high] extfrag thresholds). It
was sort-of concluded that no admin will get these tunables right for
a variety of workloads.
To eliminate the need for tunables, I proposed another patch:
https://patchwork.kernel.org/patch/11140067/
which does not add any tunables but extends and exports an existing
function (compact_zone_order). In summary, this new patch adds a
callback function which allows any driver to implement ad-hoc
compaction policies. There is also a sample driver which makes use
of this interface to keep hugepage external fragmentation within
specified range (exposed through debugfs):
https://gitlab.com/nigupta/linux/snippets/1894161
-Nitin
Powered by blists - more mailing lists