linux-kernel - Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.02.1401151304430.10727@chino.kir.corp.google.com>
Date:	Wed, 15 Jan 2014 13:19:21 -0800 (PST)
From:	David Rientjes <rientjes@...gle.com>
To:	Michal Hocko <mhocko@...e.cz>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	cgroups@...r.kernel.org,
	"Eric W. Biederman" <ebiederm@...ssion.com>
Subject: Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM
 access to memory reserves

On Wed, 15 Jan 2014, Michal Hocko wrote:

> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index b8dfed1b9d87..b86fbb04b7c6 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > > >  	 * MEMDIE process.
> > > > >  	 */
> > > > >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > > > -		     || fatal_signal_pending(current)))
> > > > > +		     || fatal_signal_pending(current))
> > > > > +		     || current->flags & PF_EXITING)
> > > > >  		goto bypass;
> > > > >  
> > > > >  	if (unlikely(task_in_memcg_oom(current)))
> > > > 
> > > > This would become problematic if significant amount of memory is charged 
> > > > in the exit() path. 
> > > 
> > > But this would hurt also for fatal_signal_pending tasks, wouldn't it?
> > 
> > Yes, and as I've said twice now, that should be removed. 
> 
> And you failed to provide any relevant data to back your suggestions. I
> have told you that we have these heuristics for ages and we need a
> strong justification to drop them. So if you really think that they are
> not appropriate then back your statements with real data.
> 
> E.g. measure how much memory are we talking about.
> 

The heuristic may have existed for ages, but the proposed memcg 
configuration for preserving memory such that userspace oom handlers may 
run such as

			 _____root______
			/		\
		    user		 oom
		   /	\		/   \
		   A	B	 	a   b

where user/memory.limit_in_bytes == [amount of present RAM] + 
oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
problematic, including Johannes' buggy bypass for charges in memcgs with 
pending memcgs that has since been fixed after I identified it.  This 
bypass is included.  Processes attached to "a" and "b" are userspace oom 
handlers for processes attached to "A" and "B", respectively.

The amount of memory you're talking about is proportional to the number of 
processes that have pending SIGKILLs (and now those with PF_EXITING set), 
the former of which is obviously more concerning since they could be 
charging memory at any point in the kernel that would succeed.  The latter 
is concerning only if future memory allocation post-PF_EXITING would be 
become significant and nobody is going to think about oom memcg bypasses 
in such a case.

To use the configuration suggested above, we need to prevent as many 
bypasses as possible to the root memcg.  Otherwise, the memory protected 
for the "oom" memcg from processes constrained by the limit of "user" is 
no longer protected.  This isn't only a problem with the bypasses here in 
the charging path, but also unaccounted kernel memory, for example.

For this to be usable, we need to ensure that the limit of the "oom" memcg 
is protected for the userspace oom handlers that are attached.  With a 
charge bypassed to the root memcg greater than or equal to the limit of 
the "oom" memcg OR cumulative charges bypassed to the root memcg greater 
than or equal to the limit of the "oom" memcg by processes with pending 
SIGKILLs, userspace oom handlers cannot respond.  That's particuarly 
dangerous without a memcg oom kill delay, as proposed before, since 
userspace must disable oom killing entirely for both "A" and "B" for 
userspace notification to be meaningful, since all processes are now 
livelocked.

> > These bypasses should be given to one thread and one thread only,
> > which would be the oom killed thread if it needs access to memory
> > reserves to either allocate memory or charge memory.
> 
> There is no way to determine whether a task has been killed due to user
> space OOM killer or by a regular kill.
> 

I'm referring to only granting TIF_MEMDIE to a single process in any memcg 
hierarchy at or below the memcg that has encountered its limit to avoid 
granting it to many processes and bypassing their charges to the root 
memcg; the same variation of the above code, but going through the memcg 
oom killer to get TIF_MEMDIE first.  We must be vigilant and only grant 
TIF_MEMDIE for the process that shall exit.

> > If you are suggesting we use the "user" and "oom" top-level memcg 
> > hierarchy for allowing memory to be available for userspace system oom 
> > handlers, then this has become important when in the past it may have been 
> > a minor point.
> 
> I am not sure it would be _that_ important and if that really becomes to
> be the case then we should deal with it. So far I haven't see any
> evidence there is a lot of memory charged on the exit path.
> 

I'm debating both fatal_signal_pending() and PF_EXITING here since they 
are now both bypasses, we need to remove fatal_signal_pending().  My 
simple question with your patch: how do you guarantee memory to processes 
attached to "a" and "b"?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/