linux-kernel - Re: [PATCH] oom killer: break from infinite loop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100402101711.GC12886@csn.ul.ie>
Date:	Fri, 2 Apr 2010 11:17:11 +0100
From:	Mel Gorman <mel@....ul.ie>
To:	David Rientjes <rientjes@...gle.com>
Cc:	Oleg Nesterov <oleg@...hat.com>, anfei <anfei.zhou@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	nishimura@....nes.nec.co.jp,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] oom killer: break from infinite loop

On Sun, Mar 28, 2010 at 02:21:01PM -0700, David Rientjes wrote:
> On Sun, 28 Mar 2010, Oleg Nesterov wrote:
> 
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
> > threads should exit and free ->mm. Even if B could exit, this is not
> > enough. And, to some extent, it doesn't matter if it holds mmap_sem
> > or not.
> > 
> > Don't get me wrong. Even if I don't understand oom_kill.c the patch
> > looks obviously good to me, even from "common sense" pov. I am just
> > curious.
> > 
> > So, my understanding is: we are going to kill the whole thread group
> > but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> > so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> > fail asap.
> > 
> > Is my understanding correct?
> > 
> 
> [Adding Mel Gorman <mel@....ul.ie> to the cc]
> 

Sorry for the delay.

> The problem with this approach is that we could easily deplete all memory 
> reserves if the oom killed task has an extremely large number of threads, 
> there has always been only a single thread with TIF_MEMDIE set per cpuset 
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
> 
> There's risk involved with suddenly allowing 1000 threads to have 
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is 
> much higher if they allocate memory prior to exit, for example.
> 
> An alternative is to fail allocations if they are failable and the 
> allocating task has a pending SIGKILL.  It's better to preempt the oom 
> killer since current is going to be exiting anyway and this avoids a 
> needless kill.
> 
> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a 
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from 
> endlessly looping while making no progress.
> 
> Comments?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
>  }
>  
>  static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
>  				unsigned long pages_reclaimed)
>  {
>  	/* Do not loop if specifically requested */
>  	if (gfp_mask & __GFP_NORETRY)
>  		return 0;
>  
> +	/* Loop if specifically requested */
> +	if (gfp_mask & __GFP_NOFAIL)
> +		return 1;
> +

Meh, you could have preserved the comment but no biggie.

> +	/* Task is killed, fail the allocation if possible */
> +	if (fatal_signal_pending(p))
> +		return 0;
> +

Seems reasonable. This will be checked on every major loop in the
allocator slow patch.

>  	/*
>  	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
>  	 * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>  	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
>  		return 1;
>  
> -	/*
> -	 * Don't let big-order allocations loop unless the caller
> -	 * explicitly requests that.
> -	 */
> -	if (gfp_mask & __GFP_NOFAIL)
> -		return 1;
> -
>  	return 0;
>  }
>  
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
>  		if (!in_interrupt() &&
>  		    ((p->flags & PF_MEMALLOC) ||
> +		     (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||

This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
that all threads with a fatal signal pending can ignore watermarks. This
is dangerous because if 1000 threads get killed, there is a possibility
of deadlocking the system.

Why not obey the watermarks and just not retry the loop later and fail
the allocation?

>  		     unlikely(test_thread_flag(TIF_MEMDIE))))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
>  	}
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int migratetype)
>  {
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
>  	struct page *page = NULL;
>  	int alloc_flags;
>  	unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
>  		goto nopage;
>  
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> +	if (test_thread_flag(TIF_MEMDIE) && !nofail)
>  		goto nopage;
>  
>  	/* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
>  	if (page)
>  		goto got_pg;
>  
> +	/* Task is killed, fail the allocation if possible */
> +	if (fatal_signal_pending(p) && !nofail)
> +		goto nopage;
> +

Again, I would expect this to be caught by should_alloc_retry().

>  	/*
>  	 * If we failed to make any progress reclaiming, then we are
>  	 * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
>  			 * made, there are no other options and retrying is
>  			 * unlikely to help.
>  			 */
> -			if (order > PAGE_ALLOC_COSTLY_ORDER &&
> -						!(gfp_mask & __GFP_NOFAIL))
> +			if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
>  				goto nopage;
>  
>  			goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>  
>  	/* Check if we should retry the allocation */
>  	pages_reclaimed += did_some_progress;
> -	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> +	if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
>  		/* Wait for some write requests to complete then retry */
>  		congestion_wait(BLK_RW_ASYNC, HZ/50);
>  		goto rebalance;
> 

I'm ok with the should_alloc_retry() change but am a lot less ok with ignoring
watermarks just because a fatal signal is pending and I think the nofail
changes  to __alloc_pages_slowpath() are unnecessary as should_alloc_retry()
should end up failing the allocations.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/