linux-kernel - Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131219144134.GH10855@dhcp22.suse.cz>
Date:	Thu, 19 Dec 2013 15:41:34 +0100
From:	Michal Hocko <mhocko@...e.cz>
To:	David Rientjes <rientjes@...gle.com>
Cc:	Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	cgroups@...r.kernel.org
Subject: Re: [patch 1/2] mm, memcg: avoid oom notification when current needs
 access to memory reserves

On Wed 18-12-13 22:09:12, David Rientjes wrote:
> On Wed, 18 Dec 2013, Michal Hocko wrote:
> 
> > > For memory isolation, we'd only want to bypass memcg charges when 
> > > absolutely necessary and it seems like TIF_MEMDIE is the only case where 
> > > that's required.  We don't give processes with pending SIGKILLs or those 
> > > in the exit() path access to memory reserves in the page allocator without 
> > > first determining that reclaim can't make any progress for the same reason 
> > > and then we only do so by setting TIF_MEMDIE when calling the oom killer.  
> > 
> > While I do understand arguments about isolation I would also like to be
> > practical here. How many charges are we talking about? Dozen pages? Much
> > more?
> 
> The PF_EXITING bypass is indeed much less concerning than the 
> fatal_signal_pending() bypass.

OK, so can we at least agree on the patch posted here:
https://lkml.org/lkml/2013/12/12/129. This is a real bug and definitely
worth fixing.

> > Besides that all of those should be very short lived because the task
> > is going to die very soon and so the memory will be freed.
> > 
> 
> We don't know how much memory is being allocated while 
> fatal_signal_pending() is true before the process can handle the SIGKILL, 
> so this could potentially bypass a significant amount of memory. 

The question is. Does it in _practice_?

We have this behavior since 867578cbccb08 which is 2.6.34 and we haven't
seen a single report where a shotdown task would break over the limit too
much. This would suggest that such a case doesn't happen very often.  If
it happens or it is easily triggerable then I am all for reverting that
check but that would require a proper justification rather than
speculations.

> If we are to have a configuration such as what Tejun recommended for
> oom handling:
> 
> 			 _____root______
> 			/		\
> 		    user		 oom
> 		   /    \		/   \
> 		  A	 B	       a     b
> 
> where the limit of A + B can be greater than the limit of user for 
> overcommit, and the limit of user is the amount of RAM minus whatever is 
> reserved for the oom hierarchy, then significant bypass to the root memcg 
> will cause memcgs in the oom hierarchy to actually not be able to allocate 
> memory from the page allocator.

I can imagine that the killed task might be in the middle of an
allocation loop and rather far away from returning to userspace (e.g.
readahead comes to mind - although that one shouldn't cause the global
OOM).
I would argue that we shouldn't reclaim in such a case and rather fail
the charge. Reclaiming will not help us much. In an extreme case we
would end up in OOM and the killed task would get TIF_MEMDIE and so it
would be allowed to bypass charges and break the isolation anyway.
Can we fail charges for killed tasks in general? I am very skeptical
because this might be a regular allocation to make a progress on the way
out.

So this doesn't solve the isolation problem, it just postpones it to
later and makes the life of other tasks in the same memcg worse because
their memory gets reclaimed which can lead to different performance
issues. And all of that for temporal charges which will go away shortly.

> The PF_EXITING bypass is much less concerning because we shouldn't be 
> doing significant memory allocation in the exit() path, but it's also true 
> that neither the PF_EXITING nor the fatal_signal_pending() bypass is 
> required. 

Yes, it is not, strictly speaking, required. It is very practical to do,
though. We do not know much about the context which called us so we
cannot base our decisions properly and just doing reclaim to see what
happens sounds like a bad decision to me.

> In Tejun's suggested configuration above, we absolutely do want 
> to reclaim from the user hierarchy before declaring oom and setting 
> TIF_MEMDIE, otherwise the oom hierarchy cannot allocate.
> 
> > So from my POV I would like to see these heuristics as simple as
> > possible and placed at very few places. Doing a bypass before charge
> > - or even after a failed charge before doing reclaim sounds like an easy
> > enough heuristic without a big risk.
> 
> It's a very significant risk of depleting memory that is available for oom 
> handling in the suggested configuration.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/