linux-kernel - Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.02.1311261634040.21003@chino.kir.corp.google.com>
Date:	Tue, 26 Nov 2013 16:47:56 -0800 (PST)
From:	David Rientjes <rientjes@...gle.com>
To:	Johannes Weiner <hannes@...xchg.org>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [patch] mm, vmscan: abort futile reclaim if we've been oom
 killed

On Thu, 21 Nov 2013, Johannes Weiner wrote:

> All I'm trying to do is find the broader root cause for the problem
> you are experiencing and find a solution that will leave us with
> maintainable code.  It does not matter how few instructions your fix
> adds, it changes the outcome of the algorithm and makes every
> developer trying to grasp the complexity of page reclaim think about
> yet another special condition.
> 
> The more specific the code is, the harder it will be to understand in
> the future.  Yes, it's a one-liner, but we've had death by a thousand
> cuts before, many times.  A few cycles ago, kswapd was blowing up left
> and right simply because it was trying to meet too many specific
> objectives from facilitating order-0 allocators, maintaining zone
> health, enabling compaction for higher order allocation, writing back
> dirty pages.  Ultimately, it just got stuck in endless loops because
> of conflicting conditionals.  We've had similar problems in the scan
> count calculation etc where all the checks and special cases left us
> with code that was impossible to reason about.  There really is a
> history of "low overhead one-liner fixes" eating us alive in the VM.
> 

Your objection is that the added code is obscure and will require kernel 
hackers to think about why it's there?  I could certainly add a comment:

	/*
	 * The oom killer only kills processes when reclaim has already
	 * failed for its allocation context, continuously trying won't
	 * help.
	 */

to the patch? 

> The solution was always to take a step back and integrate all
> requirements properly.  Not only did this fix the problems, the code
> ended up being much more robust and easier to understand and modify as
> well.
> 
> If shortening the direct reclaim cycle is an adequate solution to your
> problem, it would be much preferable.  Because
> 
>   "checking at a reasonable interval if the work I'm doing is still
>    necessary"
> 
> is a much more approachable, generic, and intuitive concept than
> 
>   "the OOM killer has gone off, direct reclaim is futile, I should
>    exit quickly to release memory so that not more tasks get caught
>    doing direct reclaim".
> 
> and the fix would benefit a much wider audience.
> 

I agree with your point that obscure random fixes does obfuscate the VM in 
many different ways and I'd like to support you in anyway that I can to 
make sure that my fix doesn't do that for anybody in the future, the 
comment being added may be one way of doing that.

I disagree with changing the "reasonable interval" to determine if reclaim 
is still necessary because many parallel reclaimers will indicate a higher 
demand for free memory and it prevents short-circuiting direct reclaim, 
returning to the page allocator, and finding that the memory you've 
reclaimed has been stolen by another allocator.

That race to allocate the reclaimed memory will always exist if checking 
zone watermarks from direct reclaim to determine whether we should 
terminate or not as part of your suggested, the alternative would be to 
actually do get_page_from_freelist() and actually allocate on every 
iteration.  For the vast majority of reclaimers, I think this would 
terminate prematurely when we haven't hit the SWAP_CLUSTER_MAX threshold 
and since the removal of lumpy reclaim and the reliance on synchronous 
memory compaction following one of these oom events to defragment memory, 
I would be tenative to implement such a solution.

> Lastly, as far as I know, you are the only reporter that noticed an
> issue with this loooooong-standing behavior, and you don't even run
> upstream kernels.  There really is no excuse to put up with a quick &
> dirty fix.
> 

I certainly do run upstream kernels and I would agree that I may be one of 
the only reporters of such an issue because Google probably hits more out 
of memory conditions to enforce memory isolation than anybody else and 
stalls like this are more noticible.

I would certainly add the comment if you feel it's necessary, though, just 
let me know.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/