linux-kernel - Re: [patch] mm: memcg: do not declare OOM from __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.02.1312021452510.13465@chino.kir.corp.google.com>
Date:	Mon, 2 Dec 2013 15:02:09 -0800 (PST)
From:	David Rientjes <rientjes@...gle.com>
To:	Michal Hocko <mhocko@...e.cz>
cc:	Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL
 allocations

On Mon, 2 Dec 2013, Michal Hocko wrote:

> > > What if the callers simply cannot deal with the allocation failure?
> > > 84235de394d97 (fs: buffer: move allocation failure loop into the
> > > allocator) describes one such case when __getblk_slow tries desperately
> > > to grow buffers relying on the reclaim to free something. As there might
> > > be no reclaim going on we are screwed.
> > > 
> > 
> > My suggestion is to spin, not return NULL. 
> 
> Spin on which level? The whole point of this change was to not spin for
> ever because the caller might sit on top of other locks which might
> prevent somebody else to die although it has been killed.
> 

See my question about the non-memcg page allocator behavior below.

> > Bypassing to the root memcg 
> > can lead to a system oom condition whereas if memcg weren't involved at 
> > all the page allocator would just spin (because of !__GFP_FS).
> 
> I am confused now. The page allocation has already happened at the time
> we are doing the charge. So the global OOM would have happened already.
> 

That's precisely the point, the successful charges can allow additional 
page allocations to occur and cause system oom conditions if you don't 
have memcg isolation.  Some customers, including us, use memcg to ensure 
that a set of processes cannot use more resources than allowed.  Any 
bypass opens up the possibility of additional memory allocations that 
cause the system to be oom and then we end up requiring a userspace oom 
handler because our policy is complex enough that it cannot be effected 
simply by /proc/pid/oom_score_adj.

I'm not quite sure how significant of a point this is, though, because it 
depends on the caller doing the __GFP_NOFAIL allocations that allow the 
bypass.  If you're doing

	for (i = 0; i < 1 << 20; i++)
		page[i] = alloc_page(GFP_NOFS | __GFP_NOFAIL);

it can become significant, but I'm unsure of how much memory all callers 
end up allocating in this context.

> > > That being said, while I do agree with you that we should strive for
> > > isolation as much as possible there are certain cases when this is
> > > impossible to achieve without seeing much worse consequences. For now,
> > > we hope that __GFP_NOFAIL is used very scarcely.
> > 
> > If that's true, why not bypass the per-zone min watermarks in the page 
> > allocator as well to allow these allocations to succeed?
> 
> Allocations are already done. We simply cannot charge that allocation
> because we have reached the hard limit. And the said allocation might
> prevent OOM action to proceed due to held locks.

I'm referring to the generic non-memcg page allocator behavior.  Forget 
memcg for a moment.  What is the behavior in the _page_allocator_ for 
GFP_NOFS | __GFP_NOFAIL?  Do we spin forever if reclaim fails or do we 
bypas the per-zone min watermarks to allow it to allocate because "it 
needs to succeed, it may be holding filesystem locks"?

It's already been acknowledged in this thread that no bypassing is done 
in the page allocator and it just spins.  There's some handwaving saying 
that since the entire system is oom that there is a greater chance that 
memory will be freed by something else, but that's just handwaving and is 
certainly no guaranteed.

So, my question again: why not bypass the per-zone min watermarks in the 
page allocator?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/