[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20190201141733.GC4926@suse.de>
Date: Fri, 1 Feb 2019 14:17:33 +0000
From: Mel Gorman <mgorman@...e.de>
To: Andrea Arcangeli <aarcange@...hat.com>
Cc: lsf-pc@...ts.linux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Peter Xu <peterx@...hat.com>,
Blake Caldwell <blake.caldwell@...orado.edu>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Michal Hocko <mhocko@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>,
David Rientjes <rientjes@...gle.com>
Subject: Re: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under
MADV_HUGEPAGE
On Tue, Jan 29, 2019 at 06:40:58PM -0500, Andrea Arcangeli wrote:
> I posted some benchmark results showing that for tasks without strong
> NUMA locality the __GFP_THISNODE logic is not guaranteed to be optimal
> (and here of course I mean even if we ignore the large slowdown with
> swap storms at allocation time that might be caused by
> __GFP_THISNODE). The results also show NUMA remote THPs help
> intrasocket as well as intersocket.
>
> https://lkml.kernel.org/r/20181210044916.GC24097@redhat.com
> https://lkml.kernel.org/r/20181212104418.GE1130@redhat.com
>
> The following seems the interim conclusion which I happen to be in
> agreement with Michal and Mel:
>
> https://lkml.kernel.org/r/20181212095051.GO1286@dhcp22.suse.cz
> https://lkml.kernel.org/r/20181212170016.GG1130@redhat.com
>
> Hopefully this strict issue will be hot-fixed before April (like we
> had to hot-fix it in the enterprise kernels to avoid the 3 years old
> regression to break large workloads that can't fit it in a single NUMA
> node and I assume other enterprise distributions will follow suit),
> but whatever hot-fix will likely allow ample margin for discussions on
> what we can do better to optimize the decision between local non-THP
> and remote THP under MADV_HUGEPAGE.
>
> It is clear that the __GFP_THISNODE forced in the current code
> provides some minor advantage to apps using MADV_HUGEPAGE that can fit
> in a single NUMA node, but we should try to achieve it without major
> disadvantages to apps that can't fit in a single NUMA node.
>
> For example it was mentioned that we could allocate readily available
> already-free local 4k if local compaction fails and the watermarks
> still allows local 4k allocations without invoking reclaim, before
> invoking compaction on remote nodes. The same can be repeated at a
> second level with intra-socket non-THP memory before invoking
> compaction inter-socket. However we can't do things like that with the
> current page allocator workflow. It's possible some larger change is
> required than just sending a single gfp bitflag down to the page
> allocator that creates an implicit MPOL_LOCAL binding to make it
> behave like the obsoleted numa/zone reclaim behavior, but weirdly only
> applied to THP allocations.
>
I would also be interested in discussing this topic. My activity is
mostly compaction-related but I believe it will evolve into something
that returns more sane data to the page allocator. That should make it a
bit easier to detect when local compaction fails and make it easier to
improve the page allocator workflow without throwing another workload
under a bus.
--
Mel Gorman
SUSE Labs
Powered by blists - more mailing lists