[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20141022135131.GB30802@dhcp22.suse.cz>
Date: Wed, 22 Oct 2014 15:51:31 +0200
From: Michal Hocko <mhocko@...e.cz>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Vladimir Davydov <vdavydov@...allels.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] memcg: remove mem_cgroup_reclaimable check from soft
reclaim
On Wed 22-10-14 08:40:25, Johannes Weiner wrote:
> On Wed, Oct 22, 2014 at 01:21:16PM +0200, Michal Hocko wrote:
> > On Tue 21-10-14 14:22:39, Johannes Weiner wrote:
> > [...]
> > > From 27bd24b00433d9f6c8d60ba2b13dbff158b06c13 Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@...xchg.org>
> > > Date: Tue, 21 Oct 2014 09:53:54 -0400
> > > Subject: [patch] mm: memcontrol: do not filter reclaimable nodes in NUMA
> > > round-robin
> > >
> > > The round-robin node reclaim currently tries to include only nodes
> > > that have memory of the memcg in question, which is quite elaborate.
> > >
> > > Just use plain round-robin over the nodes that are allowed by the
> > > task's cpuset, which are the most likely to contain that memcg's
> > > memory. But even if zones without memcg memory are encountered,
> > > direct reclaim will skip over them without too much hassle.
> >
> > I do not think that using the current's node mask is correct. Different
> > tasks in the same memcg might be bound to different nodes and then a set
> > of nodes might be reclaimed much more if a particular task hits limit
> > more often. It also doesn't make much sense from semantical POV, we are
> > reclaiming memcg so the mask should be union of all tasks allowed nodes.
>
> Unless the cpuset hierarchy is separate from the memcg hierarchy, all
> tasks in the memcg belong to the same cpuset. And the whole point of
> cpusets is that a group of tasks has the same nodemask, no?
Memory limit and memory placement are orthogonal configurations and they
might be stacked one on top of other in both directions.
> Sure, there are *possible* configurations for which this assumption
> breaks, like multiple hierarchies, but are they sensible? Do we care?
Why wouldn't they be sensible? What is wrong about limiting memory of
a load which internally uses node placement for its components?
> > What we do currently is overly complicated though and I agree that there
> > is no good reason for it.
> > Let's just s@...set_current_mems_allowed@...e_online_map@ and round
> > robin over all nodes. As you said we do not have to optimize for empty
> > zones.
>
> That was what I first had. And cpuset_current_mems_allowed defaults
> to node_online_map, but once the user sets up cpusets in conjunction
> with memcgs, it seems to be the preferred value.
>
> The other end of this is that if you have 16 nodes and use cpuset to
> bind the task to node 14 and 15, round-robin iterations of node 1-13
> will reclaim the group's memory on 14 and only the 15 iteration will
> actually look at memory from node 15 first.
mem_cgroup_select_victim_node can check reclaimability of the memcg
(hierarchy) and skip nodes without pages. Or would that be too
expensive? We are in the slow path already.
> It seems using the cpuset bindings, while theoretically independent,
> would do the right thing for all intents and purposes.
Only if cpuset is on top of memcg. Not the other way around as mentioned
above (possible node over-reclaim).
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists