[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070415152643.GH8915@holomorphy.com>
Date: Sun, 15 Apr 2007 08:26:43 -0700
From: William Lee Irwin III <wli@...omorphy.com>
To: Willy Tarreau <w@....eu>
Cc: Pekka Enberg <penberg@...helsinki.fi>,
hui Bill Huey <billh@...ppy.monkey.org>,
Ingo Molnar <mingo@...e.hu>, Con Kolivas <kernel@...ivas.org>,
ck list <ck@....kolivas.org>,
Peter Williams <pwil3058@...pond.net.au>,
linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Nick Piggin <npiggin@...e.de>, Mike Galbraith <efault@....de>,
Arjan van de Ven <arjan@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
On Sun, Apr 15, 2007 at 02:45:27PM +0200, Willy Tarreau wrote:
> Now I hope he and Bill will get over this and accept to work on improving
> this scheduler, because I really find it smarter than a dumb O(1). I even
> agree with Mike that we now have a solid basis for future work. But for
> this, maybe a good starting point would be to remove the selfish printk
> at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind)
> and improve the documentation a bit so that people can work together on
> the new design, without feeling like their work will only server to
> promote X or Y.
While I appreciate people coming to my defense, or at least the good
intentions behind such, my only actual interest in pointing out
4-year-old work is getting some acknowledgment of having done something
relevant at all. Sometimes it has "I told you so" value. At other times
it's merely clarifying what went on when people refer to it since in a
number of cases the patches are no longer extant, so they can't
actually look at it to get an idea of what was or wasn't done. At other
times I'm miffed about not being credited, whether I should've been or
whether dead and buried code has an implementation of the same idea
resurfacing without the author(s) having any knowledge of my prior work.
One should note that in this case, the first work of mine this trips
over (scheduling classes) was never publicly posted as it was only a
part of the original plugsched (an alternate scheduler implementation
devised to demonstrate plugsched's flexibility with respect to
scheduling policies), and a part that was dropped by subsequent
maintainers. The second work of mine this trips over, a virtual deadline
scheduler named "vdls," was also never publicly posted. Both are from
around the same time period, which makes them approximately 4 years dead.
Neither of the codebases are extant, having been lost in a transition
between employers, though various people recall having been sent them
privately, and plugsched survives in a mutated form as maintained by
Peter Williams, who's been very good about acknowledging my original
contribution.
If I care to become a direct participant in scheduler work, I can do so
easily enough.
I'm not entirely sure what this is about a basis for future work. By
and large one should alter the API's and data structures to fit the
policy being implemented. While the array swapping was nice for
algorithmically improving 2.4.x -style epoch expiry, most algorithms
not based on the 2.4.x scheduler (in however mutated a form) should use
a different queue structure, in fact, one designed around their
policy's specific algorithmic needs. IOW, when one alters the scheduler,
one should also alter the queue data structure appropriately. I'd not
expect the priority queue implementation in cfs to continue to be used
unaltered as it matures, nor would I expect any significant modification
of the scheduler to necessarily use a similar one.
By and large I've been mystified as to why there is such a penchant for
preserving the existing queue structures in the various scheduler
patches floating around. I am now every bit as mystified at the point
of view that seems to be emerging that a change of queue structure is
particularly significant. These are all largely internal changes to
sched.c, and as such, rather small changes in and of themselves. While
they do tend to have user-visible effects, from this point of view
even changing out every line of sched.c is effectively a micropatch.
Something more significant might be altering the schedule() API to
take a mandatory description of the intention of the call to it, or
breaking up schedule() into several different functions to distinguish
between different sorts of uses of it to which one would then respond
differently. Also more significant would be adding a new state beyond
TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, and TASK_RUNNING for some
tasks to respond only to fatal signals, then sweeping TASK_UNINTERRUPTIBLE
users to use the new state and handle those fatal signals. While not
quite as ostentatious in their user-visible effects as SCHED_OTHER
policy affairs, they are tremendously more work than switching out the
implementation of a single C file, and so somewhat more respectable.
Even as scheduling semantics go, these are micropatches. So SCHED_OTHER
changes a little. Where are the gang schedulers? Where are the batch
schedulers (SCHED_BATCH is not truly such)? Where are the isochronous
(frame) schedulers? I suppose there is some CKRM work that actually has
a semantic impact despite being largely devoted to SCHED_OTHER, and
there's some spufs gang scheduling going on, though not all that much.
And to reiterate a point from other threads, even as SCHED_OTHER
patches go, I see precious little verification that things like the
semantics of nice numbers or other sorts of CPU bandwidth allocation
between competing tasks of various natures are staying the same while
other things are changed, or at least being consciously modified in
such a fashion as to improve them. I've literally only seen one or two
tests (and rather inflexible ones with respect to sleep and running
time mixtures) with any sort of quantification of how CPU bandwidth is
distributed get run on all this.
So from my point of view, there's a lot of churn and craziness going on
in one tiny corner of the kernel and people don't seem to have a very
solid grip on what effects their changes have or how they might
potentially break userspace. So I've developed a sudden interest in
regression testing of the scheduler in order to ensure that various sorts
of semantics on which userspace relies are not broken, and am trying to
spark more interest in general in nailing down scheduling semantics and
verifying that those semantics are honored and remain honored by whatever
future scheduler implementations might be merged.
Thus far, the laundry list of semantics I'd like to have nailed down
are specifically:
(1) CPU bandwidth allocation according to nice numbers
(2) CPU bandwidth allocation among mixtures of tasks with varying
sleep/wakeup behavior e.g. that consume some percentage of cpu
in isolation, perhaps also varying the granularity of their
sleep/wakeup patterns
(3) sched_yield(), so multitier userspace locking doesn't go haywire
(4) How these work with SMP; most people agree it should be mostly the
same as it works on UP, but it's not being verified, as most
testcases are barely SMP-aware if at all, and corner cases
where proportionality breaks down aren't considered
The sorts of like explicit decisions I'd like to be made for these are:
(1) In a mixture of tasks with varying nice numbers, a given nice number
corresponds to some share of CPU bandwidth. Implementations
should not have the freedom to change this arbitrarily according
to some intention.
(2) A given scheduler _implementation_ intends to distribute CPU
bandwidth among mixtures of tasks that would each consume some
percentage of the CPU in isolation varying across tasks in some
particular pattern. For example, maybe some scheduler
implementation assigns a share of 1/%cpu to a task that would
consume %cpu in isolation, for a CPU bandwidth allocation of
(1/%cpu)/(sum 1/%cpu(t)) as t ranges over all competing tasks
(this is not to say that such a policy makes sense).
(3) sched_yield() is intended to result in some particular scheduling
pattern in a given scheduler implementation. For instance, an
implementation may intend that a set of CPU hogs calling
sched_yield() between repeatedly printf()'ing their pid's will
see their printf()'s come out in an approximately consistent
order as the scheduler cycles between them.
(4) What an implementation intends to do with respect to SMP CPU
bandwidth allocation when precise emulation of UP behavior is
impossible, considering sched_yield() scheduling patterns when
possible as well. For instance, perhaps an implementation
intends to ensure equal CPU bandwidth among competing CPU-bound
tasks of equal priority at all costs, and so triggers migration
and/or load balancing to make it so. Or perhaps an implementation
intends to ensure precise sched_yield() ordering at all costs
even on SMP. Some sort of specification of the intention, then
a verification that the intention is carried out in a testcase.
Also, if there's a semantic issue to be resolved, I want it to have
something describing it and verifying it. For instance, characterizing
whatever sort of scheduling artifacts queue-swapping causes in the
mainline scheduler and then a testcase to demonstrate the artifact and
its resolution in a given scheduler rewrite would be a good design
statement and verification. For instance, if someone wants to go back
to queue-swapping or other epoch expiry semantics, it would make them
(and hopefully everyone else) conscious of the semantic issue the
change raises, or possibly serve as a demonstration that the artifacts
can be mitigated in some implementation retaining epoch expiry semantics.
As I become aware of more potential issues I'll add more to my laundry
list, and I'll hammer out testcases as I go. My concern with the
scheduler is that this sort of basic functionality may be significantly
disturbed with no one noticing at all until a distro issues a prerelease
and benchmarks go haywire, and furthermore that changes to this kind of
basic behavior may be signs of things going awry, particularly as more
churn happens.
So now that I've clarified my role in all this to date and my point of
view on it, it should be clear that accepting something and working on
some particular scheduler implementation don't make sense as
suggestions to me.
-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists