linux-kernel - Re: [PATCH v2 0/4] memcg: Low-limit reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140605154328.GX2878@cmpxchg.org>
Date:	Thu, 5 Jun 2014 11:43:28 -0400
From:	Johannes Weiner <hannes@...xchg.org>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Greg Thelen <gthelen@...gle.com>,
	Michel Lespinasse <walken@...gle.com>,
	Tejun Heo <tj@...nel.org>, Hugh Dickins <hughd@...gle.com>,
	Roman Gushchin <klamm@...dex-team.ru>,
	LKML <linux-kernel@...r.kernel.org>, linux-mm@...ck.org,
	Rik van Riel <riel@...hat.com>
Subject: Re: [PATCH v2 0/4] memcg: Low-limit reclaim

On Thu, Jun 05, 2014 at 04:32:35PM +0200, Michal Hocko wrote:
> On Wed 04-06-14 11:44:08, Johannes Weiner wrote:
> > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > > > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > If we consider that memcg and its limits are not zone aware while the
> > > > > page allocator and reclaim are zone oriented then I can see a problem
> > > > > of unexpected reclaim failure although there is no over commit on the
> > > > > low_limit globally. And we do not have in-kernel effective measures to
> > > > > mitigate this inherent problem. At least not now and I am afraid it is
> > > > > a long route to have something that would work reasonably well in such
> > > > > cases.
> > > > 
> > > > Which "inherent problem"?
> > > 
> > > zone unawareness of the limit vs. allocation/reclaim which are zone
> > > oriented.
> > 
> > This is a quote from another subthread where you haven't responded:
> > 
> > ---
> > 
> > > > > > But who actually cares if an individual zone can be reclaimed?
> > > > > > 
> > > > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > > > is smaller than that memcg's guarantee. 
> > > > > 
> > > > > The protected group might spill over to another group and eat it when
> > > > > another group would be simply pushed out from the node it is bound to.
> > > > 
> > > > I don't really understand the point you're trying to make.
> > > 
> > > I was just trying to show a case where individual zone matters. To make
> > > it more specific consider 2 groups A (with low-limit 60% RAM) and B
> > > (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> > > having 70% of RAM reserved for guarantee makes some sense, right? B is
> > > not over-committing the node it is bound to. Yet the A's allocations
> > > might make pressure on X regardless that the whole system is still doing
> > > good. This can lead to a situation where X gets depleted and nothing
> > > would be reclaimable leading to an OOM condition.
> > 
> > Once you assume control of memory *placement* in the system like this,
> > you can not also pretend to be clueless and have unreclaimable memory
> > of this magnitude spread around into nodes used by other bound tasks.
> 
> You are still assuming that the administrator controls the placement.
> The load running in your memcg might be a black box for admin. E.g. a
> container which pays $$ to get a priority and not get reclaimed if that
> is possible. Admin can make sure that the cumulative low_limits for
> containers are sane but he doesn't have any control over what the loads
> inside are doing and potential OOM when one tries to DOS the other is
> definitely not welcome.

This is completely backwards, though: if you pay for guaranteed
memory, you don't want to get reclaimed just because some other task
that might not even have guarantees starts allocating with a
restricted node mask.  This breaks isolation.

For one, this can be used maliciously by intentionally binding a
low-priority task to a node with guaranteed memory and starting to
allocate.  Even with a small hard limit, you can just plow through
files to push guaranteed cache of the other group out of memory.

But even if it's not malicious, in such a scenario I'd still prefer
OOMing the task with the more restrictive node mask over reclaiming
guaranteed memory.

Then, on top of that, we can use direct migration to mitigate OOMs in
these scenarios (should we sufficiently care about them), but I'd much
prefer OOMs over breaking isolation and the possible priority
inversion that is inherent in the fallback on NUMA setups.

> > If we were to actively support such configurations, we should be doing
> > direct NUMA balancing and migrate these pages out of node X when B
> > needs to allocate. 
> 
> Migration is certainly a way how to reduce the risk. It is a question
> whether this is something to be done by the kernel implicitly or by
> administrator.

As long as the kernel is responsible for *any* placement - i.e. unless
you bind everything - it might as well be the kernel that fixes it up.

> > That would fix the problem for all unevictable
> > memory, not just memcg guarantees, and would prefer node-offloading
> > over swapping in cases where swap is available.
> 
> That would certainly lower the risk. But there still might be unmovable
> memory sitting on the node so this will never be 100%.

Yes, and as per above, I think in most cases it's actually preferable
to kill the bound task (or direct-migrate) over violating guarantees
of another task.

> > > > > So to me it sounds more responsible to promise only as much as we can
> > > > > handle. I think that fallback mode is not crippling the semantic of
> > > > > the knob as it triggers only for limit overcommit or strange corner
> > > > > cases. We have agreed that we do not care about the first one and
> > > > > handling the later one by potentially fatal action doesn't sounds very
> > > > > user friendly to me.
> > > > 
> > > > It *absolutely* cripples the semantics.  Think about the security use
> > > > cases of mlock for example, where certain memory may never hit the
> > > > platter.  This wouldn't be possible with your watered down guarantees.
> > > 
> > > Is this really a use case? It sounds like a weak one to me. Because
> > > any sudden memory consumption above the limit can reclaim your
> > > to-protect-page it will hit the platter and you cannot do anything about
> > > this. So yeah, this is not mlock.
> > 
> > You are right, that is a weak usecase.
> > 
> > It doesn't change the fact that it does severely weaken the semantics
> > and turns it into another best-effort mechanism that the user can't
> > count on.  This sucks.  It sucked with soft limits and it will suck
> > again.  The irony is that Greg even pointed out you should be doing
> > soft limits if you want this sort of behavior.
> 
> The question is whether we really _need_ hard guarantees. I came with
> the low_limit as a replacement for soft_limit which really sucks. But it
> sucks not because you cannot count on it. It is the way how it has
> opposite semantic which sucks - and the implementation of course. I have
> tried to fix it and that route was a no-go.

We need hard guarantees for actual isolation.  Otherwise you can't
charge for guarantees.

> I think the hard guarantee makes some sense when we allow to overcommit
> the limit. Somebody might really want to setup lowlimit == hardlimit
> because reclaim would be more harmful than restart of the application.
> I would however expect that this would be more of an exception rather
> than regular use. Most users I can think of will set low_limit to an
> effective working set size to be isolated from other loads and ephemeral
> reclaim will not hurt them. OOM would on other hand would be really
> harmful.

It's not an either-or because OOM would happen to one group, and
guaranteed memory reclaim would happen to another.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/