linux-kernel - Re: [patch 00/11] userspace out of memory handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.02.1403051831100.30075@chino.kir.corp.google.com>
Date:	Wed, 5 Mar 2014 18:52:22 -0800 (PST)
From:	David Rientjes <rientjes@...gle.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
cc:	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...e.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Christoph Lameter <cl@...ux-foundation.org>,
	Pekka Enberg <penberg@...nel.org>, Tejun Heo <tj@...nel.org>,
	Mel Gorman <mgorman@...e.de>, Oleg Nesterov <oleg@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Jianguo Wu <wujianguo@...wei.com>,
	Tim Hockin <thockin@...gle.com>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, cgroups@...r.kernel.org,
	linux-doc@...r.kernel.org
Subject: Re: [patch 00/11] userspace out of memory handling

On Wed, 5 Mar 2014, Andrew Morton wrote:

> > This patchset introduces a standard interface through memcg that allows
> > both of these conditions to be handled in the same clean way: users
> > define memory.oom_reserve_in_bytes to define the reserve and this
> > amount is allowed to be overcharged to the process handling the oom
> > condition's memcg.  If used with the root memcg, this amount is allowed
> > to be allocated below the per-zone watermarks for root processes that
> > are handling such conditions (only root may write to
> > cgroup.event_control for the root memcg).
> 
> If process A is trying to allocate memory, cannot do so and the
> userspace oom-killer is invoked, there must be means via which process
> A waits for the userspace oom-killer's action.

It does so by relooping in the page allocator waiting for memory to be 
freed just like it would if the kernel oom killer were called and process 
A was waiting for the oom kill victim process B to exit, we don't have the 
ability to put it on a waitqueue because we don't touch the freeing 
hotpath.  The userspace oom handler may not even necessarily kill 
anything, it may be able to free its own memory and start throttling other 
processes, for example.

> And there must be
> fallbacks which occur if the userspace oom killer fails to clear the
> oom condition, or times out.
> 

I agree completely and proposed this before as memory.oom_delay_millisecs 
at http://lwn.net/Articles/432226 which we use internally when memory 
can't be freed or a memcg's limit cannot be expanded.  I guess it makes 
more sense alongside the rest of this patchset now, I can add it as an 
additional patch next time around.

> Would be interested to see a description of how all this works.
> 

There's an article for LWN also being developed on this topic.  As 
mentioned in that article, I think it would be best to generalize a lot of 
the common functions and the eventfd handling entirely into a library.  
I've attached an example implementation that just invokes a function to 
handle the situation.

For Google's usecase specifically, at the root memcg level (system oom) we 
want to do priority based memcg killing.  We want to kill from within a 
memcg hierarchy that has the lowest priority relative to other memcgs.  
This cannot be implemented with /proc/pid/oom_score_adj today.  Those 
priorities may also change depending on whether a memcg hierarchy is 
"overlimit", i.e. its limit has been increased temporarily because it has 
hit a memcg oom and additional memory is readily available on the system.

So why not just introduce a memcg tunable that specifies a priority?  
Well, it's not that simple.  Other users will want to implement different 
policies on system oom (think about things like existing panic_on_oom or 
oom_kill_allocating_task sysctls).  I introduced oom_kill_allocating_task 
originally for SGI because they wanted a fast oom kill rather than 
expensive tasklist scan: the allocating task itself is rather irrelevant, 
it was just the unlucky task that was allocating at the moment that oom 
was triggered.  What's guaranteed is that current in that case will always 
free memory from under oom (it's not a member of some other mempolicy or 
cpuset that would be needlessly killed).  Both sysctls could trivially be 
reimplemented in userspace with this feature.

I have other customers who don't run in a memcg environment at all, they 
simply reattach all processes to root and delete all other memcgs.  These 
customers are only concerned about system oom conditions and want to do 
something "interesting" before a process is killed.  Some want to log the 
VM statistics as an artifact to examine later, some want to examine heap 
profiles, others can start throttling and freeing memory rather than kill 
anything.  All of this is impossible today because the kernel oom killer 
will simply kill something immediately and any stats we collect afterwards 
don't represent the oom condition.  The heap profiles are lost, throttling 
is useless, etc.

Jianguo (cc'd) may also have usecases not described here.

> It is unfortunate that this feature is memcg-only.  Surely it could
> also be used by non-memcg setups.  Would like to see at least a
> detailed description of how this will all be presented and implemented.
> We should aim to make the memcg and non-memcg userspace interfaces and
> user-visible behaviour as similar as possible.
> 

It's memcg only because it can handle both system and memcg oom conditions 
with the same clean interface, it would be possible to implement only 
system oom condition handling through procfs (a little sloppy since it 
needs to register the eventfd) but then a userspace oom handler would need 
to determine which interface to use based on whether it was running in a 
memcg or non-memcg environment.  I implemented this feature with userspace 
in mind: I didn't want it to need two different implementations to do the 
same thing depending on memcg.  The way it is written, a userspace oom 
handler does not know (nor need not care) whether it is constrained by the 
amount of system RAM or a memcg limit.  It can simply write the reserve to 
its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and 
be done.

This does mean that memcg needs to be enabled for the support, though.  
This is already done on most distributions, the cgroup just needs to be 
mounted.  Would it be better to duplicate the interface in two different 
spots depending on CONFIG_MEMCG?  I didn't think so, and I think the idea 
of a userspace library that takes care of this registration (and mounting, 
perhaps) proposed on LWN would be the best of both worlds.

> Patches 1, 2, 3 and 5 appear to be independent and useful so I think
> I'll cherrypick those, OK?
> 

Ok!  I'm hoping that the PF_MEMPOLICY bit that is removed in those patches 
is at least temporarily reserved for PF_OOM_HANDLER introduced here, I 
removed it purposefully :)
View attachment "liboom.c" of type "TEXT/x-csrc" (2431 bytes)