linux-kernel - Re: upcoming kerneloops.org item: get_page_from

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090629153007.GD5065@csn.ul.ie>
Date:	Mon, 29 Jun 2009 16:30:07 +0100
From:	Mel Gorman <mel@....ul.ie>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	penberg@...helsinki.fi, arjan@...radead.org,
	linux-kernel@...r.kernel.org, cl@...ux-foundation.org,
	npiggin@...e.de, David Rientjes <rientjes@...gle.com>
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24, 2009 at 02:56:15PM -0700, Andrew Morton wrote:
> On Wed, 24 Jun 2009 13:13:48 -0700 (PDT)
> Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > 
> > On Wed, 24 Jun 2009, Andrew Morton wrote:
> > > 
> > > If the caller gets oom-killed, the allocation attempt fails.  Callers need
> > > to handle that.
> > 
> > I actually disagree. I think we should just admit that we can always free 
> > up enough space to get a few pages, in order to then oom-kill things.
> 
> I'm unclear on precisely what you're proposing here?
> 

As order <= PAGE_ALLOC_COSTLY_ORDER implies __GFP_NOFAIL, prehaps it
makes sense to change the check to

WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER)

? The temptation might be there to remove __GFP_NOFAIL for smaller orders but
it makes sense to have it available in case CONFIG_FAULT_INJECTION_DEBUG_FS
is set and randomly failing allocations that have serious consequences even
if handled.

> > This is not a new concept. oom has never been "immediately kill".
> 
> Well, it has been immediate for a long time.  A couple of reasons which
> I can recall:
> 
> - A page-allocating process will oom-kill another process in the
>   expectation that the killing will free up some memory.  If the
>   oom-killed process remains stuck in the page allocator, that doesn't
>   work.
> 
> - The oom-killed process might be holding locks (typically fs locks).
>   This can cause an arbitrary number of other processes to be blocked.
>   So to get the system unstuck we need the oom-killed process to
>   immediately exit the page allocator, to handle the NULL return and to
>   drop those locks.
> 
> There may be other reasons - it was all a long time ago, and I've never
> personally hacked on the oom-killer much and I never get oom-killed. 
> But given the amount of development work which goes on in there, some
> people must be getting massacred.
> 
> 
> A long time ago, the Suse kernel shipped with a largely (or
> completely?) disabled oom-killer.  It removed the
> retry-small-allocations-for-ever logic and simply returned NULL to the
> caller.  I never really understood what problem/thinking led Andrea to
> do that.
> 
> 
> But it's all a bit moot at present, as we seem to have removed the
> return-NULL-if-TIF_MEMDIE logic in Mel's post-2.6.30 merges.  I think
> that was an accident:
> 
> -	/* This allocation should allow future memory freeing. */
> -
>  rebalance:
> -	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -			&& !in_interrupt()) {
> -		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> -nofail_alloc:
> -			/* go through the zonelist yet again, ignoring mins */
> -			page = get_page_from_freelist(gfp_mask, nodemask, order,
> -				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> -			if (page)
> -				goto got_pg;
> -			if (gfp_mask & __GFP_NOFAIL) {
> -				congestion_wait(WRITE, HZ/50);
> -				goto nofail_alloc;
> -			}
> -		}
> -		goto nopage;
> +	/* Allocate without watermarks if the context allows */
> +	if (alloc_flags & ALLOC_NO_WATERMARKS) {
> +		page = __alloc_pages_high_priority(gfp_mask, order,
> +				zonelist, high_zoneidx, nodemask,
> +				preferred_zone, migratetype);
> +		if (page)
> +			goto got_pg;
>  	}
> 
> Offending commit 341ce06 handled the PF_MEMALLOC case but forgot about
> the TIF_MEMDIE case.
> 
> Mel is having a bit of downtime at present.

I'm getting back online now and playing catch-up. You're right in that
TIF_MEMDIE returning NULL has been broken and it's possible in theory for an
OOM-killed process to loop forever. But maybe TIF_MEMDIE looping potentially
forever is expected in the case __GFP_NOFAIL is specified. Fixing this to allow
an OOM-killed process to exit does mean that callers using __GFP_NOFAIL must
still handle NULL being returned which might be very unexpected to the caller.

In 2.6.30, TIF_MEMDIE could cause a request to exit without ever entering
direct reclaim but chances are this didn't happen as it would have been
looping during an OOM-kill. To duplicate this, a check for TIF_MEMDIE would
happen after

	/* Avoid recursion of direct reclaim */
        if (p->flags & PF_MEMALLOC)
                goto nopage;

But as failing __GFP_NOFAIL is potentially serious, even for processes that
have been OOM killed, I think it makes more sense to check for TIF_MEMDIE
after direct reclaim and OOM killing have already been considered as options
with a patch such as the following?

==== CUT HERE ====
page-allocator: Ensure that processes that have been OOM killed exit the page allocator

Processes that have been OOM killed set the thread flag TIF_MEMDIE. A
process such as this is expected to exit the page allocator but in the
event it happens to have set __GFP_NOFAIL, it potentially loops forever.

This patch checks TIF_MEMDIE when deciding whether to loop again in the
page allocator. Such a process will now return NULL after direct reclaim
and OOM killing have both been considered as options. The potential
problem is that a __GFP_NOFAIL allocation can still return failure so
callers must still handle getting returned NULL.

Signed-off-by: Mel Gorman <mel@....ul.ie>
--- 
 mm/page_alloc.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5d714f8..8449cf9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1539,6 +1539,10 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		return 0;
 
+	/* Do not loop if this process has been OOM-killed */
+	if (test_thread_flag(TIF_MEMDIE))
+		return 0;
+
 	/*
 	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
 	 * means __GFP_NOFAIL, but that may not be true in other
@@ -1823,6 +1827,10 @@ rebalance:
 						!(gfp_mask & __GFP_NOFAIL))
 				goto nopage;
 
+			/* Do not loop if this process has been OOM-killed */
+			if (test_thread_flag(TIF_MEMDIE))
+				goto nopage;
+
 			goto restart;
 		}
 	}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/