[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.02.1311201917520.7167@chino.kir.corp.google.com>
Date: Wed, 20 Nov 2013 19:33:00 -0800 (PST)
From: David Rientjes <rientjes@...gle.com>
To: Michal Hocko <mhocko@...e.cz>
cc: linux-mm@...ck.org, Greg Thelen <gthelen@...gle.com>,
Glauber Costa <glommer@...il.com>,
Mel Gorman <mgorman@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
Rik van Riel <riel@...hat.com>,
Joern Engel <joern@...fs.org>, Hugh Dickins <hughd@...gle.com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: user defined OOM policies
On Wed, 20 Nov 2013, Michal Hocko wrote:
> > Not sure it's hard if you have per-memcg memory reserves which I've
> > brought up in the past with true and complete kmem accounting. Even if
> > you don't allocate slab, it guarantees that there will be at least a
> > little excess memory available so that the userspace oom handler isn't oom
> > itself.
> > This involves treating processes waiting on memory.oom_control to be
> > treated as a special class
>
> How do you identify such a process?
>
Unless there's a better suggestion, the registration is done in process
context and we can add a list_head to struct task_struct to identify this
special class. While memcg->under_oom, prevent this class of processes
from moving to other memcgs with -EBUSY. I'm thinking the "precharge"
allocation would be done with a separate rescounter but still accounted
for in RES_USAGE, i.e. when the first process registers for
memory.oom_control, charge memory.oom_precharge_in_bytes to RES_USAGE and
then on bypass account toward the new rescounter. This would be the
cleanest interface to do it, I believe, so the memcg assumes the cost of
the memory reserves up front, which would default to 0 and require the
owner to configure a memory.oom_precharge_in_bytes for such a reserve to
be used (I think we'd use a value of 2MB).
> > Why would there be a hang if the userspace oom handlers aren't actually
> > oom themselves as described above?
>
> Because all the reserves might be depleted.
>
It requires a high enough memory.oom_precharge_in_bytes and anything that
registers for notification would presumably add in what they require (we'd
probably only have one such oom handler per memcg). In the worst case,
memory.oom_delay_millisecs eventually solves the situation for us because
of the misconfigured userspace.
The root memcg remains under the control of root and a system oom handler
would need PF_MEMALLOC to allocate into reserves up to a sane limit (and
we can cap the root memcg's precharge to something like 1/16 of
reserves?).
> > I'd suggest against the other two suggestions because hierarchical
> > per-memcg userspace oom handlers are very powerful and can be useful
> > without actually killing anything at all, and parent oom handlers can
> > signal child oom handlers to free memory in oom conditions (in other
> > words, defer a parent oom condition to a child's oom handler upon
> > notification).
>
> OK, but what about those who are not using memcg and need a similar
> functionality? Are there any, btw?
>
We needed it for cpusets before we migrated to memcg, are you concerned
about the overhead of CONFIG_MEMCG? Otherwise, just enable it and use it
in parallel with cpusets or only the entire system if you aren't even
using memcg.
I don't know of anybody else who has these requirements, but Google
requires the callbacks to userspace to our malloc() implementation to free
unneeded arena memory and to enforce memcg priority based scoring
selection.
> > I was planning on writing a liboom library that would lay
> > the foundation for how this was supposed to work and some generic
> > functions that make use of the per-memcg memory reserves.
> >
> > So my plan for the complete solution was:
> >
> > - allow userspace notification from the root memcg on system oom
> > conditions,
> >
> > - implement a memory.oom_delay_millisecs timeout so that the kernel
> > eventually intervenes if userspace fails to respond, including for
> > system oom conditions, for whatever reason which would be set to 0
> > if no userspace oom handler is registered for the notification, and
>
> One thing I really dislike about timeout is that there is no easy way to
> find out which value is safe.
We tend to use the high side of what we expect, we've been using 10s for
four or five years now back to when we used cpusets.
> It might be easier for well controlled
> environments where you know what the load is and how it behaves. How an
> ordinary user knows which number to put there without risking a race
> where the userspace just doesn't respond in time?
>
It's always high, we use it only as a last resort. For userspace oom
handlers that only want to do heap analysis or logging, for example, they
can set it to 10s, do what they need, then write 0 to immediately defer to
the kernel. 10s should certainly be adequate for any sane userspace oom
handler.
> > - implement per-memcg reserves as described above so that userspace oom
> > handlers have access to memory even in oom conditions as an upfront
> > charge and have the ability to free memory as necessary.
>
> This has a similar issue as above. How to estimate the size of the
> reserve? How to make such a reserve stable over different kernel
> versions where the same query might consume more memory.
>
Again, we tend to go on the high side and I'd recommend something like 2MB
at the most. A userspace oom handler will only want to do basic
functionality anyway like dumping heaps to a file, reading the "tasks"
file, grabbing rss values, etc. Keep in mind that the oom precharge is
only what is allowed to be allocated at oom time, everything else can be
mlocked into memory and already charged to the memcg before it registers
for memory.oom_control.
> > We already have the ability to do the actual kill from userspace, both the
> > system oom killer and the memcg oom killer grants access to memory
> > reserves for any process needing to allocate memory if it has a pending
> > SIGKILL which we can send from userspace.
>
> Yes, the killing part is not a problem the selection is the hard one.
>
Agreed, and I think the big downside of doing it with the loadable module
suggestion is that you can't implement such a wide variety of different
policies in modules. Each of our users who own a memcg tree on our
systems may want to have their own policy and they can't load a module at
runtime or ship with the kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists