[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131213192014.GL11295@suse.de>
Date: Fri, 13 Dec 2013 19:20:14 +0000
From: Mel Gorman <mgorman@...e.de>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Dave Hansen <dave.hansen@...el.com>,
Rik van Riel <riel@...hat.com>,
Linux-MM <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 7/7] mm: page_alloc: Default allow file pages to use
remote nodes for fair allocation policy
On Fri, Dec 13, 2013 at 12:04:43PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2013 at 02:10:07PM +0000, Mel Gorman wrote:
> > Indications from Johannes that he wanted this. Needs some data and/or justification why
> > thrash protection needs it plus docs describing how MPOL_LOCAL is now different before
> > it should be considered finished. I do not necessarily agree this patch is necessary
> > but it's worth punting it out there for discussion and testing.
>
> I demonstrated enormous gains in the original submission of the fair
> allocation patch and
And the same test missed that it broke MPOL_DEFAULT and regressed any workload
that does not hit reclaim by incurring remote accesses unnecessarily. With
this patch applied, MPOL_DEFAULT again does not act as documented by
Documentation/vm/numa_memory_policy.txt and that file has been around a
long time. It also does not match the documented behaviour of mbind
where it says
The system-wide default policy allocates pages on the node of
the CPU that triggers the allocation. For MPOL_DEFAULT, the nodemask
and maxnode arguments must be specify the empty set of nodes.
That said, that documentation is also strictly wrong as MPOL_DEFAULT *may*
allocate on remote nodes.
> your tests haven't really shown downsides to the
> cache-over-nodes portion of it.
> the cache-over-nodes fairness without any supporting data.
>
It breaks MPOL_LOCAL for file-backed mappings in a manner that cannot be
overridden by policies and it is not even documented. The same effect
could have been achieved for the repeatedly reading files by running the
processes with the MPOL_INTERLEAVE policy. There was also no convenient
way for a user to override that behaviour. Hard-binding to a node would
work but tough luck if the process needs more than one node of memory.
What I will admit is that I doubt anyone cares that file-backed pages
are not node-local as documented as the cost of the IO itself probably
dominates but just because something does not make sense does not mean
someone is depending on the behaviour.
That alone is pretty heavy justification even in the absense of supporting
data showing a workload that depends on file pages being node-local that
is not hidden by the cost of the IO itself.
> Reverting cross-node fairness for anon and slab is a good idea. It
> was always about cache and the original patch was too broad stroked,
> but it doesn't invalidate everything it was about.
>
No it doesn't, but it should at least have been documented.
> I can see, however, that we might want to make this configurable, but
> I'm not eager on exporting user interfaces unless we have to. As the
> node-local fairness was never questioned by anybody, is it necessary
> to make it configurable?
It's only there since 3.12 and it takes a long time for people to notice
NUMA regressions, especially ones that would just be within a few percent
like this was unless they were specifically looking for it.
> Shouldn't we be okay with just a single
> vm.pagecache_interleave (name by Rik) sysctl that defaults to 1 but
> allows users to go back to pagecache obeying mempolicy?
>
That can be done. I can put together a patch that defaults it to 0 and
sets the DISTRIBUTE_REMOTE_FILE flag if someone writes to it. That's a
crude hack but many people will be ok with it.
To make it a default though should require more work though.
Create an MPOL_DISTRIB_PAGECACHE memory policy (name because it
is not strictly interleave). Abstract MPOL_DEFAULT to be either
MPOL_LOCAL or MPOL_DISTRIB_PAGECACHE depending on the value of
vm.pagecache_interleave. Update manual pages, and Documentation/ then set
the default of vm.pagecache_interleave to 1.
That would allow more sane defaults and also allow users to override it
on a per task and per VMA basis as they can for any other type of memory
policy.
--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists