linux-kernel - Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20140506195531.GL19914@cmpxchg.org>
Date:	Tue, 6 May 2014 15:55:31 -0400
From:	Johannes Weiner <hannes@...xchg.org>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Greg Thelen <gthelen@...gle.com>,
	Michel Lespinasse <walken@...gle.com>,
	Tejun Heo <tj@...nel.org>, Hugh Dickins <hughd@...gle.com>,
	Roman Gushchin <klamm@...dex-team.ru>,
	LKML <linux-kernel@...r.kernel.org>, linux-mm@...ck.org,
	Rik van Riel <riel@...hat.com>
Subject: Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim

On Tue, May 06, 2014 at 08:30:01PM +0200, Michal Hocko wrote:
> On Tue 06-05-14 12:51:50, Johannes Weiner wrote:
> > On Tue, May 06, 2014 at 06:12:56PM +0200, Michal Hocko wrote:
> > > On Tue 06-05-14 11:21:12, Johannes Weiner wrote:
> > > > On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
> [...]
> > > > > The strongest point was made by Rik when he claimed that memcg is not
> > > > > aware of memory zones and so one memcg with lowlimit larger than the
> > > > > size of a zone can eat up that zone without any way to free it.
> > > > 
> > > > But who actually cares if an individual zone can be reclaimed?
> > > > 
> > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > is smaller than that memcg's guarantee. 
> > > 
> > > The protected group might spill over to another group and eat it when
> > > another group would be simply pushed out from the node it is bound to.
> > 
> > I don't really understand the point you're trying to make.
> 
> I was just trying to show a case where individual zone matters. To make
> it more specific consider 2 groups A (with low-limit 60% RAM) and B
> (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> having 70% of RAM reserved for guarantee makes some sense, right? B is
> not over-committing the node it is bound to. Yet the A's allocations
> might make pressure on X regardless that the whole system is still doing
> good. This can lead to a situation where X gets depleted and nothing
> would be reclaimable leading to an OOM condition.

Once you assume control of memory *placement* in the system like this,
you can not also pretend to be clueless and have unreclaimable memory
of this magnitude spread around into nodes used by other bound tasks.

If we were to actively support such configurations, we should be doing
direct NUMA balancing and migrate these pages out of node X when B
needs to allocate.  That would fix the problem for all unevictable
memory, not just memcg guarantees, and would prefer node-offloading
over swapping in cases where swap is available.

But really, this whole scenario sounds contrived to me.  And there is
nothing specific about memcg guarantees in there.

> I can imagine that most people would rather see the lowlimit break than
> OOM. And if there is somebody who really wants OOM even under such
> condition then why not, I would be happy to add a knob which would allow
> that. But I feel that the default behavior should be the least explosive
> one...

Memcgs being node-agnostic is a reason *for* doing hard guarantees,
not against it.  If I set up guarantees on a NUMA system balanced by
the kernel, I want them to be honored, and not have my guaranteed
memory reclaimed randomly due to kernel-internal placement decisions.

> > > > And while the pages are not
> > > > reclaimable, they are still movable, so the NUMA balancer is free to
> > > > correct any allocation mistakes later on.
> > > 
> > > Do we want to depend on NUMA balancer, though?
> > 
> > You're missing my point.
> > 
> > This is about which functionality of the system is actually impeded by
> > having large portions of a zone unreclaimable.  Freeing pages in a
> > zone is means to an end, not an end in itself.
> > 
> > We wouldn't depend on the NUMA balancer to "free" a zone, I'm just
> > saying that the NUMA balancer would be unaffected by a zone full of
> > unreclaimable pages, as long as they are movable.
> 
> Agreed. I wasn't objecting to that part. I was merely noticing that we
> do not want to depend on NUMA balancer to fix up placements later just
> because they are unreclaimable due to restrictions defined outside of
> the NUMA scope.

Again, this is not a new problem.  Solve it if you want to, but don't
design a new userspace ABI around a limitation in NUMA node reclaim.

> > So who exactly cares about the ability to reclaim individual zones and
> > how is it a new type of problem compared to existing unreclaimable but
> > movable memory?
> 
> The low limit makes the current situation different. Page allocator
> simply cannot make the best decisions on the placement because it
> doesn't have any idea to which group the page gets charged to and
> therefore whether it gets protected or not. NUMA balancing can help
> to reduce this issues but I do not think it can handle the problem
> itself.

It depends on the task, not on the group.

You can turn your argument upside down: if you fail guarantees just
because a single zone is otherwise unreclaimable, then page allocator
placement ends up dictating which page is guaranteed memory and which
is not.  This really makes no sense to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/