[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20061020223553.GA14357@in.ibm.com>
Date: Sat, 21 Oct 2006 04:05:53 +0530
From: Dinakar Guniguntala <dino@...ibm.com>
To: Paul Jackson <pj@....com>
Cc: nickpiggin@...oo.com.au, mbligh@...gle.com, akpm@...l.org,
menage@...gle.com, Simon.Derr@...l.net,
linux-kernel@...r.kernel.org, rohitseth@...gle.com, holt@....com,
dipankar@...ibm.com, suresh.b.siddha@...el.com, clameter@....com
Subject: Re: [RFC] cpuset: remove sched domain hooks from cpusets
On Fri, Oct 20, 2006 at 02:41:53PM -0700, Paul Jackson wrote:
>
> > 2. The main change is that we dont allow tasks to be added to a cpuset
> > if it has child cpusets that also have the sched_domain flag turned on
> > (Maybe return a EINVAL if the user tries to do that)
>
> This I would not like. It's ok to have tasks in cpusets that are
> cut by sched domain partitions (which is what I think you were getting
> at), just so long as one doesn't mind that they don't load balance
> across the partition boundaries.
>
> For example, we -always- have several tasks per-cpu in the top cpuset.
> These are the per-cpu kernel threads. They have zero interest in
> load balancing, because they are pinned on a cpu, for their life.
I cannot think of any reason why this change would affect per-cpu tasks.
>
> Or, for a slightly more interesting example, one might have a sleeping
> job (batch scheduler sent SIGPAUSE to all its threads) that is in a
> cpuset cut by the current sched domain partitioning. Since that job is
> not running, we don't care whether it gets good load balancing services
> or not.
ok here's when I think a system administrator would want to partition
sched domains. If there is an application that is very sensitive to
performance and latencies and would have very low tolerance for
interference from any other code running on the cpus, then the
admin would partition the sched domain and separate this application
from the rest of the system. (per-cpu threads obviously will
continue to run in the same domain as the app)
So in this example, clearly there is no sense in letting a batch job
run in the same sched domain as our application. Now lets say if our
performance and latency sensitive application only runs during the
day, then the admin can turn off the sched domain flag and tear down
the sched domain for the night. This will then enable the batch job
running in the parent cpuset to get a chance to run on all the cpus.
Returning -EINVAL when trying to attach a job to the top cpuset when
it has a child cpuset(s) that has the sched_domain flag turned on, would
mean that the administrator would know that s/he does not have all of
the cpus in that cpuset for their use. However by attaching jobs
(such as the batch job in your example) to the top cpuset, before
doing any sched domain partitioning would mean that they make the best
use of resources as well (sort of a backdoor). However if you feel
that this puts too much of a restriction on the admin for creating
tasks such as the batch job, then we would have to do without it
(just documenting the sched_domain and its effects)
>
> I still suspect we will just have to let the admin partition their
> system as they will, and if they screw up their load balancing,
> the best we can do is to make all this as transparent and simple
> and obvious as we can, and wish them well.
>
> One thing I'm sure of. The current (ab)use of the 'cpu_exclusive' flag
> to define sched domain partitions is flunking the "transparent, simple
> and obvious" test ;).
I think this is a case of one set of folks talking <32 cpu systems
and another set talking >512 cpu systems
>
> > I think the main issue here is that most of the users dont have to
> > do more than one level of partitioning (having to partitioning a system
> > with not more than 16 - 32 cpus, mostly less)
>
> Could you (or some sched domain wizard) explain to me why we would even
> want sched domain partitions on such 'small' systems? I've been operating
> under the (mis?)conception that these sched domain partitions were just
> a performance band-aid for the humongous systems, where load balancing
> across say 1024 CPUs was difficult to do efficiently.
Well it makes a difference for applications that have a RT/performance
sensitive componant that needs a sched domain of its own
-Dinakar
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists