[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4626B9BD.8020806@bigpond.net.au>
Date: Thu, 19 Apr 2007 10:37:17 +1000
From: Peter Williams <pwil3058@...pond.net.au>
To: Linus Torvalds <torvalds@...ux-foundation.org>
CC: Matt Mackall <mpm@...enic.com>, Nick Piggin <npiggin@...e.de>,
William Lee Irwin III <wli@...omorphy.com>,
Mike Galbraith <efault@....de>,
Con Kolivas <kernel@...ivas.org>, Ingo Molnar <mingo@...e.hu>,
ck list <ck@....kolivas.org>,
Bill Huey <billh@...ppy.monkey.org>,
linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Arjan van de Ven <arjan@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair
Scheduler [CFS]
Linus Torvalds wrote:
>
> On Wed, 18 Apr 2007, Matt Mackall wrote:
>> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
>>> And "fairness by euid" is probably a hell of a lot easier to do than
>>> trying to figure out the wakeup matrix.
>> For the record, you actually don't need to track a whole NxN matrix
>> (or do the implied O(n**3) matrix inversion!) to get to the same
>> result.
>
> I'm sure you can do things differently, but the reason I think "fairness
> by euid" is actually worth looking at is that it's pretty much the
> *identical* issue that we'll have with "fairness by virtual machine" and a
> number of other "container" issues.
>
> The fact is:
>
> - "fairness" is *not* about giving everybody the same amount of CPU time
> (scaled by some niceness level or not). Anybody who thinks that is
> "fair" is just being silly and hasn't thought it through.
>
> - "fairness" is multi-level. You want to be fair to threads within a
> thread group (where "process" may be one good approximation of what a
> "thread group" is, but not necessarily the only one).
>
> But you *also* want to be fair in between those "thread groups", and
> then you want to be fair across "containers" (where "user" may be one
> such container).
>
> So I claim that anything that cannot be fair by user ID is actually really
> REALLY unfair. I think it's absolutely humongously STUPID to call
> something the "Completely Fair Scheduler", and then just be fair on a
> thread level. That's not fair AT ALL! It's the anti-thesis of being fair!
>
> So if you have 2 users on a machine running CPU hogs, you should *first*
> try to be fair among users. If one user then runs 5 programs, and the
> other one runs just 1, then the *one* program should get 50% of the CPU
> time (the users fair share), and the five programs should get 10% of CPU
> time each. And if one of them uses two threads, each thread should get 5%.
>
> So you should see one thread get 50& CPU (single thread of one user), 4
> threads get 10% CPU (their fair share of that users time), and 2 threads
> get 5% CPU (the fair share within that thread group!).
>
> Any scheduling argument that just considers the above to be "7 threads
> total" and gives each thread 14% of CPU time "fairly" is *anything* but
> fair. It's a joke if that kind of scheduler then calls itself CFS!
>
> And yes, that's largely what the current scheduler will do, but at least
> the current scheduler doesn't claim to be fair! So the current scheduler
> is a lot *better* if only in the sense that it doesn't make ridiculous
> claims that aren't true!
>
> Linus
Sounds a lot like the PLFS (process level fair sharing) scheduler in
Aurema's ARMTech (for whom I used to work). The "fair" in the title is
a bit misleading as it's all about unfair scheduling in order to meet
specific policies. But it's based on the principle that if you can
allocate CPU band width "fairly" (which really means in proportion to
the entitlement each process is allocated) then you can allocate CPU
band width "fairly" between higher level entities such as process
groups, users groups and so on by subdividing the entitlements downwards.
The tricky part of implementing this was the fact that not all entities
at the various levels have sufficient demand for CPU band width to use
their entitlements and this in turn means that the entities above them
will have difficulty using their entitlements even if other of their
subordinates have sufficient demand (because their entitlements will be
too small). The trick is to have a measure of each entity's demand for
CPU bandwidth and use that to modify the way entitlement is divided
among subordinates.
As a first guess, an entity's CPU band width usage is an indicator of
demand but doesn't take into account unmet demand due to tasks waiting
on a run queue waiting for access to the CPU. On the other hand, usage
plus time waiting on the queue isn't a good measure of demand either
(although it's probably a good upper bound) as it's unlikely that the
task would have used the same amount of CPU as the waiting time if it
had gone straight to the CPU.
But my main point is that it is possible to build schedulers that can
achieve higher level scheduling policies. Versions of PLFS work on
Windows from user space by twiddling process priorities. Part of my
more recent work at Aurema had been involved in patching Linux's
scheduler so that nice worked more predictably so that we could release
a user space version of PLFS for Linux. The other part was to add hard
CPU band width caps for processes so that ARMTech could enforce hard CPU
bandwidth caps on higher level entities (as this can't be done without
the kernel being able to do it at that level.
Peter
--
Peter Williams pwil3058@...pond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists