linux-kernel - Re: user defined OOM policies

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.02.1311201917520.7167@chino.kir.corp.google.com>
Date:	Wed, 20 Nov 2013 19:33:00 -0800 (PST)
From:	David Rientjes <rientjes@...gle.com>
To:	Michal Hocko <mhocko@...e.cz>
cc:	linux-mm@...ck.org, Greg Thelen <gthelen@...gle.com>,
	Glauber Costa <glommer@...il.com>,
	Mel Gorman <mgorman@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Rik van Riel <riel@...hat.com>,
	Joern Engel <joern@...fs.org>, Hugh Dickins <hughd@...gle.com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: user defined OOM policies

On Wed, 20 Nov 2013, Michal Hocko wrote:

> > Not sure it's hard if you have per-memcg memory reserves which I've 
> > brought up in the past with true and complete kmem accounting.  Even if 
> > you don't allocate slab, it guarantees that there will be at least a 
> > little excess memory available so that the userspace oom handler isn't oom 
> > itself.
> > This involves treating processes waiting on memory.oom_control to be 
> > treated as a special class
> 
> How do you identify such a process?
> 

Unless there's a better suggestion, the registration is done in process 
context and we can add a list_head to struct task_struct to identify this 
special class.  While memcg->under_oom, prevent this class of processes 
from moving to other memcgs with -EBUSY.  I'm thinking the "precharge" 
allocation would be done with a separate rescounter but still accounted 
for in RES_USAGE, i.e. when the first process registers for 
memory.oom_control, charge memory.oom_precharge_in_bytes to RES_USAGE and 
then on bypass account toward the new rescounter.  This would be the 
cleanest interface to do it, I believe, so the memcg assumes the cost of 
the memory reserves up front, which would default to 0 and require the 
owner to configure a memory.oom_precharge_in_bytes for such a reserve to 
be used (I think we'd use a value of 2MB).

> > Why would there be a hang if the userspace oom handlers aren't actually 
> > oom themselves as described above?
> 
> Because all the reserves might be depleted.
> 

It requires a high enough memory.oom_precharge_in_bytes and anything that 
registers for notification would presumably add in what they require (we'd 
probably only have one such oom handler per memcg).  In the worst case, 
memory.oom_delay_millisecs eventually solves the situation for us because 
of the misconfigured userspace.

The root memcg remains under the control of root and a system oom handler 
would need PF_MEMALLOC to allocate into reserves up to a sane limit (and 
we can cap the root memcg's precharge to something like 1/16 of 
reserves?).

> > I'd suggest against the other two suggestions because hierarchical 
> > per-memcg userspace oom handlers are very powerful and can be useful 
> > without actually killing anything at all, and parent oom handlers can 
> > signal child oom handlers to free memory in oom conditions (in other 
> > words, defer a parent oom condition to a child's oom handler upon 
> > notification). 
> 
> OK, but what about those who are not using memcg and need a similar
> functionality? Are there any, btw?
> 

We needed it for cpusets before we migrated to memcg, are you concerned 
about the overhead of CONFIG_MEMCG?  Otherwise, just enable it and use it 
in parallel with cpusets or only the entire system if you aren't even 
using memcg.

I don't know of anybody else who has these requirements, but Google 
requires the callbacks to userspace to our malloc() implementation to free 
unneeded arena memory and to enforce memcg priority based scoring 
selection.

> > I was planning on writing a liboom library that would lay 
> > the foundation for how this was supposed to work and some generic 
> > functions that make use of the per-memcg memory reserves.
> >
> > So my plan for the complete solution was:
> > 
> >  - allow userspace notification from the root memcg on system oom 
> >    conditions,
> > 
> >  - implement a memory.oom_delay_millisecs timeout so that the kernel 
> >    eventually intervenes if userspace fails to respond, including for
> >    system oom conditions, for whatever reason which would be set to 0
> >    if no userspace oom handler is registered for the notification, and
> 
> One thing I really dislike about timeout is that there is no easy way to
> find out which value is safe.

We tend to use the high side of what we expect, we've been using 10s for 
four or five years now back to when we used cpusets.

> It might be easier for well controlled
> environments where you know what the load is and how it behaves. How an
> ordinary user knows which number to put there without risking a race
> where the userspace just doesn't respond in time?
> 

It's always high, we use it only as a last resort.  For userspace oom 
handlers that only want to do heap analysis or logging, for example, they 
can set it to 10s, do what they need, then write 0 to immediately defer to 
the kernel.  10s should certainly be adequate for any sane userspace oom 
handler.

> >  - implement per-memcg reserves as described above so that userspace oom 
> >    handlers have access to memory even in oom conditions as an upfront
> >    charge and have the ability to free memory as necessary.
> 
> This has a similar issue as above. How to estimate the size of the
> reserve? How to make such a reserve stable over different kernel
> versions where the same query might consume more memory.
> 

Again, we tend to go on the high side and I'd recommend something like 2MB 
at the most.  A userspace oom handler will only want to do basic 
functionality anyway like dumping heaps to a file, reading the "tasks" 
file, grabbing rss values, etc.  Keep in mind that the oom precharge is 
only what is allowed to be allocated at oom time, everything else can be 
mlocked into memory and already charged to the memcg before it registers 
for memory.oom_control.

> > We already have the ability to do the actual kill from userspace, both the 
> > system oom killer and the memcg oom killer grants access to memory 
> > reserves for any process needing to allocate memory if it has a pending 
> > SIGKILL which we can send from userspace.
> 
> Yes, the killing part is not a problem the selection is the hard one.
> 

Agreed, and I think the big downside of doing it with the loadable module 
suggestion is that you can't implement such a wide variety of different 
policies in modules.  Each of our users who own a memcg tree on our 
systems may want to have their own policy and they can't load a module at 
runtime or ship with the kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/