[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0708010439050.1817@scrub.home>
Date: Wed, 1 Aug 2007 05:41:10 +0200 (CEST)
From: Roman Zippel <zippel@...ux-m68k.org>
To: Mike Galbraith <efault@....de>
cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Ingo Molnar <mingo@...e.hu>,
Andrew Morton <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org
Subject: CFS review
Hi,
On Sat, 14 Jul 2007, Mike Galbraith wrote:
> > On Fri, 13 Jul 2007, Mike Galbraith wrote:
> >
> > > > The new scheduler does _a_lot_ of heavy 64 bit calculations without any
> > > > attempt to scale that down a little...
> > >
> > > See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity.
> > > Perhaps more can be done, but "without any attempt..." isn't accurate.
> >
> > Calculating these values at runtime would have been completely insane, the
> > alternative would be a crummy approximation, so using a lookup table is
> > actually a good thing. That's not the problem.
>
> I meant see usage.
I more meant serious attempts. At this point I'm not that much interested
in a few localized optimizations, what I'm interested in is how can this
optimized at the design level (e.g. how can arch information be used to
simplify things). So I spent quite a bit of time looking through cfs and
experimenting with some ideas. I want to put the main focus on the
performance aspect, but there are a few other issues as well.
But first something else (especially for Ingo): I tried to be very careful
with any claims made in this mail, but this of course doesn't exclude the
possibility of errors, in which case I'd appreciate any corrections. Any
explanations done in this mail don't imply that anyone needs any such
explanations, they're done to keep things in context, so that interested
readers have a chance to follow even if they don't have the complete
background information. Any suggestions made don't imply that they have to
be implemented like this, there are more an incentive for further
discussion and I'm always interested in better solutions.
A first indication that something may not be quite right is the increase
in code size:
2.6.22:
text data bss dec hex filename
10150 24 3344 13518 34ce kernel/sched.o
recent git:
text data bss dec hex filename
14724 228 2020 16972 424c kernel/sched.o
That's i386 without stats/debug. A lot of the new code is in regularly
executed regions and it's often not exactly trivial code as cfs added
lots of heavy 64bit calculations. With the increased text comes
increased runtime memory usage, e.g. task_struct increased so that only
5 of them instead 6 fit now into 8KB.
Since sched-design-CFS.txt doesn't really go into any serious detail, so
the EEVDF paper was more helpful and after playing with the ideas a
little I noticed that the whole idea of fair scheduling can be explained
somewhat simpler and I'm a little surprised not finding it mentioned
anywhere.
So a different view on this is that the runtime of a task is simply
normalized and the virtual time (or fair_clock) is the weighted average of
these normalized runtimes. The advantage of normalization is that it
makes things comparable, once the normalized time values are equal each
task got its fair share. It's more obvious in the EEVDF paper, cfs makes
it a bit more complicated, as it uses the virtual time to calculate the
eligible runtime, but it doesn't maintain a per process virtual time
(fair_key is not quite the same).
Here we get to the first problem, cfs is not overly accurate at
maintaining a precise balance. First there a lot of rounding errors due
to constant conversion between normalized and non-normalized values and
the higher the update frequency the bigger the error. The effect of
this can be seen by running:
while (1)
sched_yield();
and watching the sched_debug output and watch the underrun counter go
crazy. cfs thus needs the limiting to keep this misbehaviour under
control. The problem here is that it's not that difficult to hit one of
the many limits, which may change the behaviour and makes cfs hard to
predict how it will behave under different situations.
The next issue is scheduler granularity, here I don't quite understand
why the actual running time has no influence at all, which makes it
difficult to predict how much cpu time a process will get at a time
(even the comments only refer to the vmstat output). What is basically
used instead is the normalized time since it was enqueued and
practically it's a bit more complicated, as fair_key is not entirely a
normalized time value. If the wait_runtime value is positive, higher
prioritized tasks are given even more priority than they already get
from their larger wait_runtime value. The problem here is that this
triggers underruns and lower priority tasks get even less time.
Another issue is the sleep bonus given to sleeping tasks. A problem here
is that this can be exploited, if a job is spread over a few threads,
they can get more time relativ to other tasks, e.g. in this example
there are three tasks that run only for about 1ms every 3ms, but they
get far more time than should have gotten fairly:
4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt
4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt
4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt
4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l
The debug output for this is also interesting:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------
lt 4544 42958653977764 -2800118 2797765 11753 120 11449444657 201615584 23750292211 9600 0
lt 4545 42958653767067 -3010815 2759284 11747 120 11442604925 202910051 23746575676 9613 0
l 4547 42958653330276 -3447606 1332892 32333 120 1035284848 -47628857 0 0 14247
Practically this means a few interactive tasks can steal quite a lot of
time from other tasks, which might try to get some work done. This may
be fine in Desktop environments, but I'm not sure it can be that easily
generalized. This can make cfs quite unfair, if waiting outside the
runqueue has more advantages than waiting in the runqueue.
Finally there is still the issue with the nice levels, where I still
think the massive increase overcompensates for adequate scheduling
classes, e.g. to give audio apps a fixed amount of time instead of a
relative portion.
Overall while cfs has a number of good ideas, I don't think it was
quite ready for a stable release. Maybe it's possible to fix this until
the release, but I certainly would have prefered less time pressure.
It's not just the small inaccuracies, it's also the significantly
increased complexity. In this regard I find the claim that cfs "has no
heuristics whatsoever" interesting, for that to be true I would expect a
little more accuracy, but that wouldn't be a big problem, if cfs were a
really fast and compact scheduler and here it's quite hard to argue that
cfs has been improvement in this particular area. Anyway, before Ingo
starts accussing me now of "negativism", let's look at some
possibilities how this could be improved.
To reduce the inaccuracies it's better to avoid conversion between
normalized and real time values, so the first example program pretty much
shows just that and demonstrate the very core of a scheduler. It maintains
per task normalized times and uses only that to make the scheduling
decision (it doesn't even need an explicit wait_runtime value). The nice
thing about this how consistently it gives out time shares - unless there
is a higher priority task, that task will get the maximum share, which
makes the scheduling behaviour quite a bit more predictable.
The second example program is more complete (e.g. it demonstrates
adding/removing tasks) but it's based on the same basic idea. First the
floating point values are converted to fixed point values. To maintain
accuracy one has to take overflows into account, cfs currently avoids
overflows by throwing away (possibly important) information, but that
adds checks all over the place instead of dealing with them within the
design, so I consider this overflow avoidance a failed strategy - it
doesn't make anything simpler and creates problems elsewhere. Of course
the example program has its own limits, but in this case I can define
them, so that within them the scheduler will work correctly. The most
important limits are:
max normalized task time delta = max inverse weight * max cpu share
max normalized average time delta = max active tasks * max inverse weight * cpu share base
The first limit is used for comparing individual task and the second one
is used for maintaining the average. With these limits it's possible to
choose the appropriate data types that can hold these maximum values and
then I don't really have to worry about overflows and I know the
scheduler is accurate without the need for a lot of extra checks.
The second example also adds a normalized average, but contrary to
current cfs it's not central to the design. An average is not needed to
give every task its fair share, but it can be used to make better
scheduling decisions to control _when_ a task gets its share. In this
example I provided two possibilities where it's used to schedule new
tasks, the first divides time into slices and sets a new task to the
start of that slice, the second gives the task a full time share
relative to the current average, but approximating the average (by
looking just at the min/max time) should work as well.
The average here is not a weighted average, a weighted average is a
little more complex to maintain accurately and has issues regarding
overflows, so I'm using a simple average, which is sufficient especially
since it's not a primary part of the scheduler anymore.
BTW above unfairness of sleeping tasks can be easily avoided in this
model by simply making sure that normalized time never goes backward.
The accuracy of this model makes it possible to further optimize the
code (it's really a key element, that's why I'm stressing it so much,
OTOH it's rather difficult to further optimize current cfs without
risking to make it worse).
For example the regular updates aren't really necessary, they can also
be done when necessary (i.e. when scheduling, where an update is
necessary anyway for precise accounting), the next schedule time can be
easily precalculated instead. OTOH the regular updates allow for very
cheap incremental updates, especially if one already knows that
scheduler clock has only limited resolution (because it's based on
jiffies), it becomes possible to use mostly 32bit values.
I hope the code example helps to further improve scheduler, I'm quite
aware that it doesn't implement everything, but this just means some of
the cfs design decisions need more explanation. I'm not really that
much interested in scheduler, I only want a small and fast scheduler and
that's some areas where cfs is no real improvement right now. cfs
practically obliterated my efforts I put into the ntp code to keep the
regular updates both cheap and highly precise...
bye, Roman
View attachment "fs.c" of type "TEXT/x-csrc" (1247 bytes)
View attachment "fs2.c" of type "TEXT/x-csrc" (4966 bytes)
Powered by blists - more mailing lists