[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20110714115913.cf8d1b9d.kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 14 Jul 2011 11:59:13 +0900
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Cc: Michal Hocko <mhocko@...e.cz>, linux-mm@...ck.org,
Balbir Singh <bsingharora@...il.com>,
Daisuke Nishimura <nishimura@....nes.nec.co.jp>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than
coutner
On Thu, 14 Jul 2011 10:02:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com> wrote:
> On Wed, 13 Jul 2011 13:05:49 +0200
> Michal Hocko <mhocko@...e.cz> wrote:
>
> > 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> > which is incremented by mem_cgroup_oom_lock when we are about to handle
> > memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> > oom_lock > 1 to prevent from multiple oom kills at the same time.
> > The counter is then decremented by mem_cgroup_oom_unlock called from the
> > same function.
> >
> > This works correctly but it can lead to serious starvations when we
> > have many processes triggering OOM.
> >
> > Consider a process (call it A) which gets the oom_lock (the first one
> > that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
> > other processes are blocked on the mutex.
> > While A releases the mutex and calls mem_cgroup_out_of_memory others
> > will wake up (one after another) and increase the counter and fall into
> > sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
> > takes the mutex again and decreases oom_lock and wakes other tasks (if
> > releasing memory of the killed task hasn't done it yet).
> > The main problem here is that everybody still race for the mutex and
> > there is no guarantee that we will get counter back to 0 for those
> > that got back to mem_cgroup_handle_oom. In the end the whole convoy
> > in/decreases the counter but we do not get to 1 that would enable
> > killing so nothing useful is going on.
> > The time is basically unbounded because it highly depends on scheduling
> > and ordering on mutex.
> >
>
> Hmm, ok, I see the problem.
>
>
> > This patch replaces the counter by a simple {un}lock semantic. We are
> > using only 0 and 1 to distinguish those two states.
> > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > that we cannot race with somebody else which is already guaranteed
> > because we call both functions with the mutex held. All other consumers
> > just read the value atomically for a single group which is sufficient
> > because we set the value atomically.
> > The other thing is that only that process which locked the oom will
> > unlock it once the OOM is handled.
> >
> > Signed-off-by: Michal Hocko <mhocko@...e.cz>
> > ---
> > mm/memcontrol.c | 24 +++++++++++++++++-------
> > 1 files changed, 17 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e013b8e..f6c9ead 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > /*
> > * Check OOM-Killer is already running under our hierarchy.
> > * If someone is running, return false.
> > + * Has to be called with memcg_oom_mutex
> > */
> > static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > {
> > - int x, lock_count = 0;
> > + int x, lock_count = -1;
> > struct mem_cgroup *iter;
> >
> > for_each_mem_cgroup_tree(iter, mem) {
> > - x = atomic_inc_return(&iter->oom_lock);
> > - lock_count = max(x, lock_count);
> > + x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > + if (lock_count == -1)
> > + lock_count = x;
> > +
>
>
> Hmm...Assume following hierarchy.
>
> A
> B C
> D E
>
> The orignal code hanldes the situation
>
> 1. B-D-E is under OOM
> 2. A enters OOM after 1.
>
> In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> The new code invokes A will invoke new OOM....right ?
>
> I wonder this kind of code
> ==
> bool success = true;
> ...
> for_each_mem_cgroup_tree(iter, mem) {
> success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> /* "break" loop is not allowed because of css refcount....*/
> }
> return success.
> ==
> Then, one hierarchy can invoke one OOM kill within it.
> But this will not work because we can't do proper unlock.
>
>
> Hm. how about this ? This has only one lock point and we'll not see the BUG.
> Not tested yet..
>
Here, tested patch + test program. this seems to work well.
==
>From 8c878b3413b4d796844dbcb18fa7cfccf44860d7 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Date: Thu, 14 Jul 2011 11:36:50 +0900
Subject: [PATCH] memcg: fix livelock at oom.
867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
which is incremented by mem_cgroup_oom_lock when we are about to handle
memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.
This works correctly but it can lead to serious starvations when we
have many processes triggering OOM.
example.
Make a hierarchy of memcg, which has 300MB memory+swap limit.
%cgcreate -g memory:A
%cgset -r memory.use_hierarchy=1 A
%cgset -r memory.limit_in_bytes=300M A
%cgset -r memory.memsw.limit_in_bytes=300M A
%cgcreate -g memory:A/B
%cgcreate -g memory:A/C
%cgcreate -g memory:A/B/X
%cgcreate -g memory:A/B/Y
Then, running folloing program under A/B/X.
%cgexec -g memory:A/B/X ./fork
==
int main(int argc, char *argv[])
{
int i;
int status;
for (i = 0; i < 5000; i++) {
if (fork() == 0) {
char *c;
c = malloc(1024*1024);
memset(c, 0, 1024*1024);
sleep(20);
fprintf(stderr, "[%d]\n", i);
exit(0);
}
printf("%d\n", i);
waitpid(-1, &status, WNOHANG);
}
while (1) {
int ret;
ret = waitpid(-1, &status, WNOHANG);
if (ret == -1)
break;
if (!ret)
sleep(1);
}
return 0;
}
==
This forks a process and the child malloc(1M). Then, after forking 300
childrens, the memcg goes int OOM. Expected behavior is oom-killer
will kill process and make progress. But, 300 children will try to get
oom_lock and oom kill...and the program seems not to make progress.
The reason is that memcg invokes OOM-Kill when the counter oom_lock is 0.
But if many process runs, it never goes down to 0 because it's incremanted
before all processes quits sleep by previous oom-lock, which decremetns
oom_lock.
This patch fixes the behavior. This patch makes the oom-hierarchy should
have only one lock value 1/0. For example, in following hierarchy,
A
/
B
/ \
C D
When C goes into OOM because of limit by B, set B->oom_lock=1
After that, when A goes into OOM because of limit by A,
clear B->oom_lock as 0 and set A->oom_lock=1.
At unlocking, the ancestor which has ()->oom_lock=1 will be cleared.
After this, above program will do fork 5000 procs with 4000+ oom-killer.
Reported-by: Michal Hocko <mhocko@...e.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Changelog:
- fixed under_oom counter reset.
---
mm/memcontrol.c | 77 +++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 53 insertions(+), 24 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..5f9661b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -246,7 +246,8 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
- atomic_t oom_lock;
+ int oom_lock;
+ atomic_t under_oom;
atomic_t refcnt;
unsigned int swappiness;
@@ -1801,36 +1802,63 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
}
/*
- * Check OOM-Killer is already running under our hierarchy.
+ * Check whether OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
*/
-static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
+static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
{
- int x, lock_count = 0;
struct mem_cgroup *iter;
+ bool ret;
- for_each_mem_cgroup_tree(iter, mem) {
- x = atomic_inc_return(&iter->oom_lock);
- lock_count = max(x, lock_count);
+ /*
+ * If an ancestor (including this memcg) is the owner of OOM Lock.
+ * return false;
+ */
+ for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
+ if (iter->oom_lock)
+ break;
+ if (!iter->use_hierarchy) {
+ iter = NULL;
+ break;
+ }
}
- if (lock_count == 1)
- return true;
- return false;
+ if (iter)
+ return false;
+ /*
+ * Make the owner of OOM lock to be the highest ancestor of hierarchy
+ * under OOM. IOW, move children's OOM owner information to this memcg
+ * if a child is owner. In this case, an OOM killer is running and
+ * we return false. But make this memcg as owner of oom-lock.
+ */
+ ret = true;
+ for_each_mem_cgroup_tree(iter, memcg) {
+ if (iter->oom_lock) {
+ iter->oom_lock = 0;
+ ret = false;
+ }
+ atomic_set(&iter->under_oom, 1);
+ }
+ /* Make this memcg as the owner of OOM lock. */
+ memcg->oom_lock = 1;
+ return ret;
}
-static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
{
- struct mem_cgroup *iter;
+ struct mem_cgroup *iter, *iter2;
- /*
- * When a new child is created while the hierarchy is under oom,
- * mem_cgroup_oom_lock() may not be called. We have to use
- * atomic_add_unless() here.
- */
- for_each_mem_cgroup_tree(iter, mem)
- atomic_add_unless(&iter->oom_lock, -1, 0);
- return 0;
+ for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
+ if (iter->oom_lock) {
+ iter->oom_lock = 0;
+ break;
+ }
+ }
+ BUG_ON(!iter);
+
+ for_each_mem_cgroup_tree(iter2, iter)
+ atomic_set(&iter2->under_oom, 0);
+ return;
}
@@ -1875,7 +1903,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
static void memcg_oom_recover(struct mem_cgroup *mem)
{
- if (mem && atomic_read(&mem->oom_lock))
+ if (mem && atomic_read(&mem->under_oom))
memcg_wakeup_oom(mem);
}
@@ -1916,7 +1944,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
finish_wait(&memcg_oom_waitq, &owait.wait);
}
mutex_lock(&memcg_oom_mutex);
- mem_cgroup_oom_unlock(mem);
+ if (locked)
+ mem_cgroup_oom_unlock(mem);
memcg_wakeup_oom(mem);
mutex_unlock(&memcg_oom_mutex);
@@ -4584,7 +4613,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
list_add(&event->list, &memcg->oom_notify);
/* already in OOM ? */
- if (atomic_read(&memcg->oom_lock))
+ if (atomic_read(&memcg->under_oom))
eventfd_signal(eventfd, 1);
mutex_unlock(&memcg_oom_mutex);
@@ -4619,7 +4648,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
- if (atomic_read(&mem->oom_lock))
+ if (atomic_read(&mem->under_oom))
cb->fill(cb, "under_oom", 1);
else
cb->fill(cb, "under_oom", 0);
--
1.7.4.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists