linux-kernel - Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 18 Dec 2013 21:04:34 +0100
From:	Michal Hocko <mhocko@...e.cz>
To:	David Rientjes <rientjes@...gle.com>
Cc:	Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	cgroups@...r.kernel.org
Subject: Re: [patch 1/2] mm, memcg: avoid oom notification when current needs
 access to memory reserves

On Tue 17-12-13 12:50:09, David Rientjes wrote:
> On Tue, 17 Dec 2013, Michal Hocko wrote:
> 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index c72b03bf9679..fee25c5934d2 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2692,7 +2693,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > >  	 * MEMDIE process.
> > > >  	 */
> > > >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > > -		     || fatal_signal_pending(current)))
> > > > +		     || fatal_signal_pending(current))
> > > > +		     || current->flags & PF_EXITING)
> > > >  		goto bypass;
> > > >  
> > > >  	if (unlikely(task_in_memcg_oom(current)))
> > > > 
> > > > rather than the later checks down the oom_synchronize paths. The comment
> > > > already mentions dying process...
> > > > 
> > > 
> > > This is scary because it doesn't even try to reclaim memcg memory before 
> > > allowing the allocation to succeed.
> > 
> > Why should it reclaim in the first place when it simply is on the way to
> > release memory. In other words why should it increase the memory
> > pressure when it is in fact releasing it?
> > 
> 
> (Answering about removing the fatal_signal_pending() check as well here.)
> 
> For memory isolation, we'd only want to bypass memcg charges when 
> absolutely necessary and it seems like TIF_MEMDIE is the only case where 
> that's required.  We don't give processes with pending SIGKILLs or those 
> in the exit() path access to memory reserves in the page allocator without 
> first determining that reclaim can't make any progress for the same reason 
> and then we only do so by setting TIF_MEMDIE when calling the oom killer.  

While I do understand arguments about isolation I would also like to be
practical here. How many charges are we talking about? Dozen pages? Much
more?
Besides that all of those should be very short lived because the task
is going to die very soon and so the memory will be freed.

So from my POV I would like to see these heuristics as simple as
possible and placed at very few places. Doing a bypass before charge
- or even after a failed charge before doing reclaim sounds like an easy
enough heuristic without a big risk.
I have really hard time to see big benefits for forcing reclaim for a
very short lived charge because this might lead to different and much
worse side effects then a quantum noise.

Maybe I am missing something and we can charge a lot during exit but
then I think we should fix the exit path to not allocate that much.

> > I am really puzzled here. On one hand you are strongly arguing for not
> > notifying when we know we can prevent from OOM action and on the other
> > hand you are ok to get vmpressure/thresholds notification when an
> > exiting task triggers reclaim.
> > 
> > So I am really lost in what you are trying to achieve here. It sounds a
> > bit arbirtrary.
> > 
> 
> It's not arbitrary to define when memcg bypass is allowed and, in my 
> opinion, it should only be done in situations where it is unavoidable and 
> therefore breaking memory isolation is required.
> 
> (We wouldn't expect a 128MB memcg to be oom [and perhaps with a userspace 
> oom handler attached] when it has 100 children each 1MB in size just 
> because they all happen to be oom at the same time.  We set up the excess 

s/oom/exiting/ ?

> memory in the parent specifically for the memcg with the oom handler 
> attached.)

I am not sure I understand what you meant here.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/