[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.21.1810221346130.120157@chino.kir.corp.google.com>
Date: Mon, 22 Oct 2018 13:54:33 -0700 (PDT)
From: David Rientjes <rientjes@...gle.com>
To: Andrea Arcangeli <aarcange@...hat.com>
cc: Andrew Morton <akpm@...ux-foundation.org>,
Michal Hocko <mhocko@...nel.org>, Mel Gorman <mgorman@...e.de>,
Vlastimil Babka <vbabka@...e.cz>,
Andrea Argangeli <andrea@...nel.org>,
Zi Yan <zi.yan@...rutgers.edu>,
Stefan Priebe - Profihost AG <s.priebe@...fihost.ag>,
"Kirill A. Shutemov" <kirill@...temov.name>, linux-mm@...ck.org,
LKML <linux-kernel@...r.kernel.org>,
Stable tree <stable@...r.kernel.org>
Subject: Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE
mappings
On Mon, 15 Oct 2018, Andrea Arcangeli wrote:
> > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@...gle.com> wrote:
> > > Would it be possible to test with my
> > > patch[*] that does not try reclaim to address the thrashing issue?
> >
> > Yes please.
>
> It'd also be great if a testcase reproducing the 40% higher access
> latency (with the one liner original fix) was available.
>
I never said 40% higher access latency, I said 40% higher fault latency.
The higher access latency is 13.9% as measured on Haswell.
The test case is rather trivial: fragment all memory with order-4 memory
to replicate a fragmented local zone, use sched_setaffinity() to bind to
that node, and fault a reasonable number of hugepages (128MB, 256,
whatever). The cost of faulting remotely in this case was measured to be
40% higher than falling back to local small pages. This occurs quite
obviously because you are thrashing the remote node trying to allocate
thp.
> We don't have a testcase for David's 40% latency increase problem, but
> that's likely to only happen when the system is somewhat low on memory
> globally.
Well, yes, but that's most of our systems. We can't keep around gigabytes
of memory free just to work around this patch. Removing __GFP_THISNODE to
avoid thrashing the local node obviously will incur a substantial
performance degradation if you thrash the remote node as well. This
should be rather straight forward.
> When there's 75% or more of the RAM free (not even allocated as easily
> reclaimable pagecache) globally, you don't expect to hit heavy
> swapping.
>
I agree there is no regression introduced by your patch when 75% of memory
is free.
> The 40% THP allocation latency increase if you use MADV_HUGEPAGE in
> such window where all remote zones are fully fragmented is somehow
> lesser of a concern in my view (plus there's the compact deferred
> logic that should mitigate that scenario). Furthermore it is only a
> concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is
> set the userland allocation is long lived, so such higher allocation
> latency won't risk to hit short lived allocations that don't set
> MADV_HUGEPAGE (unless madvise=always, but that's not the default
> precisely because not all allocations are long lived).
>
> If the MADV_HUGEPAGE using library was freely available it'd also be
> nice.
>
You scan your mappings for .text segments, map a hugepage-aligned region
sufficient in size, mremap() to that region, and do MADV_HUGEPAGE.
Powered by blists - more mailing lists