linux-kernel - Re: upcoming kerneloops.org item: get_page_from

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.00.0906301216180.16312@chino.kir.corp.google.com>
Date:	Tue, 30 Jun 2009 12:35:59 -0700 (PDT)
From:	David Rientjes <rientjes@...gle.com>
To:	Mel Gorman <mel@....ul.ie>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Pekka Enberg <penberg@...helsinki.fi>, arjan@...radead.org,
	linux-kernel@...r.kernel.org,
	Christoph Lameter <cl@...ux-foundation.org>,
	Nick Piggin <npiggin@...e.de>
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Tue, 30 Jun 2009, Mel Gorman wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 175a67a..5f4656e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -230,14 +230,21 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  		/*
>  		 * This task already has access to memory reserves and is
>  		 * being killed. Don't allow any other task access to the
> -		 * memory reserve.
> +		 * memory reserve unless the current process is the one
> +		 * selected for OOM-killing. If the current process has
> +		 * been OOM-killed and we are OOM again, another process
> +		 * needs to be considered for OOM-kill
>  		 *
>  		 * Note: this may have a chance of deadlock if it gets
>  		 * blocked waiting for another task which itself is waiting
>  		 * for memory. Is there a better alternative?
>  		 */
> -		if (test_tsk_thread_flag(p, TIF_MEMDIE))
> -			return ERR_PTR(-1UL);
> +		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> +			if (p == current)
> +				continue;
> +			else
> +				return ERR_PTR(-1UL);
> +		}
>  
>  		/*
>  		 * This is in the process of releasing memory so wait for it

This will panic the machine if current is the only user thread running or 
eligible for oom kill (all others could have mm's with OOM_DISABLE set).  
Currently, such tasks can exit or kthreads can free memory so that the oom 
is recoverable.

The problem with this approach is that it increases the liklihood that 
memory reserves will be totally depleted when several threads are 
competing for them.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5d714f8..5896469 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1539,6 +1539,14 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>  	if (gfp_mask & __GFP_NORETRY)
>  		return 0;
>  
> +	/* Do not loop if OOM-killed unless __GFP_NOFAIL is specified */
> +	if (test_thread_flag(TIF_MEMDIE)) {
> +		if (gfp_mask & __GFP_NOFAIL)
> +			WARN(1, "Potential infinite loop with __GFP_NOFAIL");
> +		else
> +			return 0;
> +	}
> +

There's a second bug in the refactored page allocator: when the oom killer 
is invoked and it happens to kill current, alloc_flags is never updated 
because it loops back to `restart', which is past gfp_to_alloc_flags().

When that is fixed with the patch below, gfp_mask & __GFP_NOFAIL will 
never be true here in your scenario, where the oom killer kills current, 
because __alloc_pages_high_priority() will infinitely loop.

>  	/*
>  	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
>  	 * means __GFP_NOFAIL, but that may not be true in other
> 

This is needed for 2.6.31-rc2.


mm: update alloc_flags after oom killer has been called

It is possible for the oom killer to select current as the task to kill.  
When this happens, alloc_flags needs to be updated accordingly to set 
ALLOC_NO_WATERMARKS so the subsequent allocation attempt may use memory 
reserves as the result of its thread having TIF_MEMDIE set if the 
allocation is not __GFP_NOMEMALLOC.

Signed-off-by: David Rientjes <rientjes@...gle.com>
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1756,6 +1756,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	wake_all_kswapd(order, zonelist, high_zoneidx);
 
+restart:
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
@@ -1763,7 +1764,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-restart:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/