[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.00.1009011436390.29305@chino.kir.corp.google.com>
Date: Wed, 1 Sep 2010 15:06:20 -0700 (PDT)
From: David Rientjes <rientjes@...gle.com>
To: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
cc: LKML <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Oleg Nesterov <oleg@...hat.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
Minchan Kim <minchan.kim@...il.com>
Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from
oom_badness()
On Mon, 30 Aug 2010, KOSAKI Motohiro wrote:
> > > Current oom_score_adj is completely broken because It is strongly bound
> > > google usecase and ignore other all.
> > >
> >
> > That's wrong, we don't even use this heuristic yet and there is nothing,
> > in any way, that is specific to Google.
>
> Please show us an evidence. Big mouth is no good way to persuade us.
Evidence that Google isn't using this currently?
> I requested you "COMMUNICATE REAL WORLD USER", do you really realized this?
>
We are certainly looking forward to using this when 2.6.36 is released
since we work with both cpusets and memcg.
> > > 1) Priority inversion
> > > As kamezawa-san pointed out, This break cgroup and lxr environment.
> > > He said,
> > > > Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> > > > And A uses 200M, B uses 1G of memory under 4G system
> > > >
> > > > Under the system.
> > > > A's socre = (200M *1000)/4G + 300 = 350
> > > > B's score = (1G * 1000)/4G = 250.
> > > >
> > > > In the cpuset, it has 2G of memory.
> > > > A's score = (200M * 1000)/2G + 300 = 400
> > > > B's socre = (1G * 1000)/2G = 500
> > > >
> > > > This priority-inversion don't happen in current system.
> > >
> >
> > You continually bring this up, and I've answered it three times, but
> > you've never responded to it before and completely ignore it.
>
> Yes, I ignored. Don't talk your dream. I hope to see concrete use-case.
> As I repeatedly said, I don't care you while you ignore real world end user.
> ANY BODY DON'T EXCEPT STABILIZATION DEVELOPERS ARE KINDFUL FOR END USER
> HARMFUL. WE HAVE NO MERCY WHILE YOU CONTINUE TO INMORAL DEVELOPMENT.
>
I'm not ignoring any user with this change, oom_score_adj is an extremely
powerful interface for users who want to use it. I'm sorry that it's not
as simple to use as you may like.
Basically, it comes down to this: few users actually tune their oom
killing priority, period. That's partly because they accept the oom
killer's heuristics to kill a memory-hogging task or use panic_on_oom, or
because the old interface, /proc/pid/oom_adj, had no unit and no logical
way of using it other than polarizing it (either +15 or -17).
For those users who do change their oom killing priority, few are using
cpusets or memcg. Yes, the priority changes depending on the context of
the oom, but for users who don't use these cgroups the oom_score_adj unit
is static since the amount of system memory (the only oom constraint) is
static.
Now, for the users of both oom_score_adj and cpusets or memcg (in the
future this will include Google), these users are interested in oom
killing priority relative to other tasks attached to the same set of
resources. For our particular use case, we attach an aggregate of tasks
to a cgroup and have a preference on the order in which those tasks are
killed whenever that cgroup limit is exhausted. We also care about
protecting vital system tasks so that they aren't targeted before others
are killed, such as job schedulers.
I think the key point your missing in our use case is that we don't
necessary care about the system-wide oom condition when we're running with
cpusets or memcg. We can protect tasks with negative oom_score_adj, but
we don't care about close tiebreakers on which cpuset or memg is penalized
when the entire system is out of memory. If that's the case, each cpuset
and memcg is also, by definition, out of memory, so they are all subject
to the oom killer. This is equivalent to having several tasks with an
oom_score_adj of +1000 (or oom_adj of +15) and only one getting killed
based on the order of the tasklist.
So there is actually no "bug" or "regression" in this behavior (especially
since the old oom killer had inversion as well because it factored cpuset
placement into the heuristic score) and describing it as such is
misleading. It's actually a very powerful interface for those who choose
to use it and accurately reflect the way the oom killer chooses tasks:
relative to other eligible tasks competing for the same set of resources.
> > I don't know what this means, and this was your criticism before I changed
> > the denominator during the revision of the patchset, so it's probably
> > obsoleted. oom_score_adj always operates based on the proportion of
> > memory available to the application which is how the new oom killer
> > determines which tasks to kill: relative to the importance (if defined by
> > userspace) and memory usage compared to other tasks competing for it.
>
> I already explained asymmetric numa issue in past. again, don't assuem
> you policy and your machine if you want to change kernel core code.
>
Please explain it again, I don't see what the asymmetry in NUMA node size
has to do with either mempolicy or cpuset ooms. The administrator is
fully aware of the sizes of these nodes at the time he or she attaches
them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists