linux-kernel - Re: CFS review

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0708011745330.1817@scrub.home>
Date:	Thu, 2 Aug 2007 19:36:51 +0200 (CEST)
From:	Roman Zippel <zippel@...ux-m68k.org>
To:	Peter Zijlstra <peterz@...radead.org>
cc:	Mike Galbraith <efault@....de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Ingo Molnar <mingo@...e.hu>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org
Subject: Re: CFS review

Hi,

On Wed, 1 Aug 2007, Peter Zijlstra wrote:

> Took me most of today trying to figure out WTH you did in fs2.c, more
> math and fundamental explanations would have been good. So please bear
> with me as I try to recap this thing. (No, your code was very much _not_
> obvious, a few comments and broken out functions would have made a world
> of a difference)

Thanks for the effort though. :)
I know I'm not the best explaining these things, so I really appreciate 
the questions, so I know what to concentrate on.

> So, for each task we keep normalised time
> 
>  normalised time := time/weight
> 
> using Bresenham's algorithm we can do this prefectly (up until a renice
> - where you'd get errors)
> 
>         avg_frac += weight_inv
>         
>         weight_inv = X / weight
>         
>         avg = avg_frac / weight0_inv
>         
>         weight0_inv = X / weight0
>         
>         avg = avg_frac / (X / weight0) 
>             = (X / weight) / (X / weight0) 
>             = X / weight * weight0 / X 
>             = weight0 / weight
>         
> 
> So avg ends up being in units of [weight0/weight].
> 
> Then, in order to allow sleeping, we need to have a global clock to sync
> with. Its this global clock that gave me headaches to reconstruct.
> 
> We're looking for a time like this:
> 
>   rq_time := sum(time)/sum(weight)
> 
> And you commented that the /sum(weight) part is where CFS obtained its
> accumulating rounding error? (I'm inclined to believe the error will
> statistically be 0, but I'll readily accept otherwise if you can show a
> practical 'exploit')
> 
> Its not obvious how to do this using modulo logic like Bresenham because
> that would involve using a gcm of all possible weights.

I think I've sent you off into the wrong direction somehow. Sorry. :)

Let's ignore the average for a second, normalized time is maintained as:

	normalized time := time * (2^16 / weight)

The important point is that I keep the value in full resolution of 2^-16 
vsec units (vsec for virtual second or sec/weight, where every tasks gets 
weight seconds for every virtual second, to keep things simpler I also 
omit the nano prefix from the units for a moment). Compared to that CFS 
maintains a global normalized value in 1 vsec units.
Since I don't round the value down I avoid the accumulating error, this 
means that 

	time_norm += time_delta1 * (2^16 / weight)
	time_norm += time_delta2 * (2^16 / weight)

is the same as

	time_norm += (time_delta1 + time_delta2) * (2^16 / weight)

CFS for example does this

	delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw);

in above terms this means

	time = time_delta * weight * (2^16 / weight_sum) / 2^16

The last shift now rounds the value down and if one does that 1000 times 
per second, the resolution of the value that is finally accounted to 
wait_runtime is also reduced appropriately.

The other rounding problem is based on that this term

	x * prio_to_weight[i] * prio_to_wmult[i] / 2^32

doesn't produce x for most values in that tables (the same applies to the 
weight sum), so if we have chains, where the values are converted from one 
scale to the other, a rounding error is produced. In CFS this happens now 
because wait_runtime is maintained in nanoseconds and fair_clock is a 
normalized value.

The problem here isn't that these errors might have a statistical 
relevance, as they are usually completely overshadowed by measurement 
errors anyway. The problem is that these errors exist at all, this means 
they have to be compensated somehow, so that they don't accumulate over 
time and then become significant. This also has to be seen in the context 
of the overflow checks. All this adds a number of variables to the system 
which considerably increases complexity and makes a thorough analysis 
quite challenging.

So to get back to the average, if you look for this

	rq_time := sum(time)/sum(weight)

you won't find it like this, this basically produces a weighted average 
and I agree this can't really be maintained via the modulo logic (at least 
AFAICT), so I'm using a simple average instead, so if we have:

	time_norm = time/weight

we can write your rq_time like this:

	weighted_avg = sum_{i}^{N}(time_norm_{i}*weight_{i})/sum_{i}^{N}(weight_{i})

this is the formula for a weighted average, so we can aproximate the value 
using a simple average instead:

	avg = sum_{i}^{N}(time_norm_{i})/N

This sum is now what I maintain at runtime incrementally:

	time_{i} = sum_{j}^{S}(time_{j})

	time_norm_{i} = time_{i}/weight_{i}
	              = sum_{j}^{S}(time_{j})/weight_{i}
	              = sum_{j}^{S}(time_{j}/weight_{i})

If I add this up and add weigth0 I get:

	avg*N*weigth0 = sum_{i}^{N}(time_norm_{i})*weight0

and now I have also the needed modulo factors.

The average probably could be further simplified by using a different 
approximation. The question is how perfect this average has to be. The 
average is only used to control when a task gets its share, currently 
higher priority tasks are already given a bit more preference or a sleep 
bonus is added.

In CFS the average already isn't perfect due to above rounding problems, 
otherwise the sum of all updated wait_runtime should be 0 and if a task 
with a wait_runtime value different from 0 is added, fair_clock would have 
to change too to keep the balance.

So unless someone has a brilliant idea, I guess we have to settle for the 
approximation of a perfect scheduler. :) My approach is insofar different 
that I at least maintain an accurate fair share and approximate based on 
this the scheduling decision. This has IMO the advantage that the 
scheduler function can be easily exchanged, one can do it the quick and 
dirty way or one can put in the extra effort to get it closer to 
perfection. Either way every task will get its fair share.

The scheduling function I used is rather simple:

	if (j >= 0 &&
	    ((int)(task[l].time_norm - (task[j].time_norm + SLICE(j))) >= 0 ||
	     ((int)(task[l].time_norm - task[j].time_norm) >= 0 &&
	      (int)(task[l].time_norm - (s + SLICE(l))) >= 0))) {

So a new task is selected if there is a higher priority task or if the 
task has used up its share (unless the other task has lower priority and 
already got its share). It would be interesting to use a dynamic (per 
task) time slice here, which should make it possible to control the 
burstiness that has been mentioned.

> I'm not sure all this is less complex than CFS, I'd be inclined to say
> it is more so.

The complexity is different, IMO the basic complexity to maintain the fair 
share is less and I can add arbitrary complexity to improve scheduling on 
top of it. Most of it comes now from maintaining the accurate average, it 
depends now on how the scheduling is finally done what other optimizations 
are possible.

> Also, I think you have an accumulating error on wakeup where you sync
> with the global clock but fully discard the fraction.

I hope not, otherwise the checks at the end should have triggered. :)
The per task requirement is

	time_norm = time_avg * weight0 + avg_fract

I don't simply discard it, it's accounted to time_norm and then synced to 
the global average.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/