[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <479F1118.BA47.005A.0@novell.com>
Date: Tue, 29 Jan 2008 09:42:16 -0700
From: "Gregory Haskins" <ghaskins@...ell.com>
To: "Paul Jackson" <pj@....com>
Cc: <a.p.zijlstra@...llo.nl>, <mingo@...e.hu>,
<dmitry.adamushko@...il.com>, <rostedt@...dmis.org>,
<menage@...gle.com>, <rientjes@...gle.com>, <tong.n.li@...el.com>,
<tglx@...utronix.de>, <akpm@...ux-foundation.org>,
<dhaval@...ux.vnet.ibm.com>, <vatsa@...ux.vnet.ibm.com>,
<sgrubb@...hat.com>, <linux-kernel@...r.kernel.org>,
<ebiederm@...ssion.com>, <nickpiggin@...oo.com.au>
Subject: Re: scheduler scalability - cgroups, cpusets and
load-balancing
>>> On Tue, Jan 29, 2008 at 11:28 AM, in message
<20080129102836.be614579.pj@....com>, Paul Jackson <pj@....com> wrote:
> Gregory wrote:
>> I am a bit confused as to why you disable load-balancing in the
>> RT cpuset? It shouldn't be strictly necessary in order for the
>> RT scheduler to do its job (unless I am misunderstanding what you
>> are trying to accomplish?). Do you do this because you *have*
>> to in order to make real-time deadlines, or because its just a
>> further optimization?
>
> My primary motivation for cpusets originally, and for the
> sched_load_balance flag now, was not realtime, but "soft partitioning"
> of big NUMA systems, especially for batch schedulers. They sometimes
> have large cpusets which are only being used to hold smaller, per-job,
> cpusets. It is a waste of time (CPU cycles in the kernel sched code)
> to load balance those large cpusets. Load balancing doesn't scale
> easily to high CPU counts, and it's nice to avoid doing that where
> not needed.
Understood, and that makes tons of sense.
>
> See the following lkml message for a fuller explanation:
>
> http://lkml.org/lkml/2008/1/29/85
>
> As a secondary motivation, I thought that disabling load balancing on
> the RT cpuset was the right thing to do for RT needs, but I make no
> claim to knowing much about RT.
Well, I make no claim to understand the large batch systems you work on either ;) Everything you said made a ton of sense other than the RT/load-balance thing, but I think we are on the same page now.
>
> I just now realized that you added a 'root_domain' in a patch in
> late Nov and early Dec. I was on the road then, moving from
> California to Texas, and not paying much attention to Linux.
np (though I was wondering why you had no comment before ;)
>
> A couple of questions on that patch, both involving a comment it adds
> to kernel/sched.c:
>
> /*
> * We add the notion of a root-domain which will be used to define per-domain
> * variables. Each exclusive cpuset essentially defines an island domain by
> * fully partitioning the member cpus from any other cpuset. Whenever a new
> * exclusive cpuset is created, we also create and attach a new root-domain
> * object.
> */
>
> 1) What are 'per-domain' variables?
s/per-domain/per-root-domain
>
> 2) The mention of 'exclusive cpuset' is no longer correct.
>
> With the patch 'remove sched domain hooks from cpusets' cpusets
> no longer defines sched domains using the cpu_exclusive flag.
>
> With the subsequent sched_load_balance patch (see
> http://lkml.org/lkml/2007/10/6/19) cpusets uses a new per-cpuset
> flag 'sched_load_balance' to define sched domains.
Doh! Thanks for the heads up.
>
> The following revised comment might be more accurate:
>
> /*
> * We add the notion of a root-domain which will be used to define per-domain
> * variables. Each non-overlapping sched domain defines an island domain by
> * fully partitioning the member cpus from any other cpuset. Whenever a new
> * such a sched domain is created, we also create and attach a new
> root-domain
> * object. These non-overlapping sched domains are determined by the cpuset
> * configuration, via a call to partition_sched_domains().
> */
>
> It sounds like you (Gregory, others) want your RT CPUs to be in a sched
> domain, unlike the current way things are, where my cpuset code
> carefully avoids setting up a sched domain for those CPUs. However I
> still have need, in the batch scheduler case explained above, to have
> some CPUs not in any sched domain.
>
> If you require these RT sched domains to be setup differently somehow,
> in some way that is visible to partition_sched_domains, then that
> apparently means we need a per-cpuset flag to mark those RT cpusets.
I think we only need a plain-vanilla partition, so no flags should be necessary.
-Greg
>
> If you just want an ordinary sched domain setup (just so long as it
> contains only the intended RT CPUs, not others) then I guess we don't
> technically need any more per-cpuset flags, but I'm worried, because
> the API we're presenting to users for this has just gone from subtle to
> bizarre. I suspect I'll want to add a flag anyway, if by doing so, I
> can make the kernel-user API, via cpusets, easier to understand.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists