linux-kernel - Re: [patch] CFS scheduler, v3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4629A349.8060207@bigpond.net.au>
Date:	Sat, 21 Apr 2007 15:38:17 +1000
From:	Peter Williams <pwil3058@...pond.net.au>
To:	William Lee Irwin III <wli@...omorphy.com>
CC:	Ingo Molnar <mingo@...e.hu>, linux-kernel@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Con Kolivas <kernel@...ivas.org>,
	Nick Piggin <npiggin@...e.de>, Mike Galbraith <efault@....de>,
	Arjan van de Ven <arjan@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>, caglar@...dus.org.tr,
	Willy Tarreau <w@....eu>, Gene Heskett <gene.heskett@...il.com>
Subject: Re: [patch] CFS scheduler, v3

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> This essentially doesn't look correct because while you want to enforce
>>> the CPU bandwidth allocation, this doesn't have much to do with that
>>> apart from the CPU bandwidth appearing as a term. It's more properly
>>> a rate of service as opposed to a time at which anything should happen
>>> or a number useful for predicting such. When service should begin more
>>> properly depends on the other tasks in the system and a number of other
>>> decisions that are part of the scheduling policy.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> This model takes all of those into consideration.  The idea is not just 
>> to predict but to use the calculated time to decide when to boot the 
>> current process of the CPU (if it doesn't leave voluntarily) and put 
>> this one on.  This more or less removes the need to give each task a 
>> predetermined chunk of CPU when they go on to the CPU.  This should, in 
>> general, reduce the number context switches as tasks get to run until 
>> they've finished what they're doing or another task becomes higher 
>> priority rather than being booted off after an arbitrary time interval. 
>>  (If this ever gets tried it will be interesting to see if this 
>> prediction comes true.)
>> BTW Even if Ingo doesn't choose to try this model, I'll probably make a 
>> patch (way in the future after Ingo's changes are settled) to try it out 
>> myself.
> 
> I think I smoked out what you were doing.
> 
> 
> William Lee Irwin III wrote:
>>> If you want to choose a "quasi-inter-arrival time" to achieve the
>>> specified CPU bandwidth allocation, this would be it, but to use that
>>> to actually enforce the CPU bandwidth allocation, you would need to
>>> take into account the genuine inter-arrival time to choose an actual
>>> time for service to begin. In other words, this should be a quota for
>>> the task to have waited. If it's not waited long enough, then it should
>>> be delayed by the difference to achieve the inter-arrival time you're
>>> trying to enforce. If it's waited longer, it should be executed
>>> sooner modulo other constraints, and perhaps even credited for future
>>> scheduling cycles.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> The idea isn't to enforce the bandwidth entitlement to the extent of 
>> throttling tasks if they exceed their entitlement and there's no other 
>> tasks ready to use the CPU.  This is mainly because the bandwidth 
>> entitlement isn't fixed -- it's changing constantly as the number and 
>> type of runnable tasks changes.
> 
> Well, a little hysteresis will end up throttling in such a manner
> anyway as a side-effect,

Think of this as a calming influence :-)

> or you'll get anomalies. Say two tasks with
> equal entitlements compete, where one sleeps for 1/3 of the time and
> the other is fully CPU-bound. If only the times when they're in direct
> competition are split 50/50, then the CPU-bound task gets 2/3 and the
> sleeper 1/3, which is not the intended effect. I don't believe this
> model will be very vulnerable to it, though.

Nor me.

> 
> 
> William Lee Irwin III wrote:
>>> In order to partially avoid underallocating CPU bandwidth to p, one
>>> should track the time last spent sleeping and do the following:
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> Yes I made a mistake in omitting to take into account sleep interval. 
>> See another e-mail to Ingo correcting this problem.
> 
> I took it to be less trivial of an error than it was. No big deal.

No, you were right it was definitely a NON trivial error.

> 
> 
> William Lee Irwin III wrote:
>>> In order to do better, longer-term history is required,
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> The half life of the Kalman filter (roughly equivalent to a running 
>> average) used to calculate the averages determines how much history is 
>> taken into account.  It could be made configurable (at least, until 
>> enough real life experience was available to decide on the best value to 
>> use).
> 
> A Kalman filter would do better than a running average. I'm all for it.

As a long time user of Kalman filters I tend to think of them as the 
same thing.  I use the term running average when talking about the idea 
behind a scheduler because I think that more people will understand what 
the general idea is.  When it comes to implementation I always replace 
the idea of "running average" with a roughly equivalent Kalman filter.

> 
> 
> William Lee Irwin III wrote:
>>> To attempt to maintain an infinite history of
>>> bandwidth underutilization to be credited too far in the future would
>>> enable potentially long-term overutilization when such credits are
>>> cashed en masse for a sustained period of time. At some point you have
>>> to say "use it or lose it;" over a shorter period of time some smoothing
>>> is still admissible and even desirable.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> Yes, that's why I suggest a running average over the last few scheduling 
>> cycles for the task.  But thinking about it some more I'm now not so 
>> sure.  The lack of apparent "smoothness" when I've done this sort of 
>> thing with raw rather than smooth data (in modifications to the current 
>> dynamic priority based scheduler model) is usually noticed by running 
>> top and seeing wildly fluctuating dynamic priorities.  I'm not sure that 
>> the actual responsiveness of the system reflects this.  So I'm now 
>> willing to reserve my judgement on this issue.
> 
> I'm thinking smoothing should be over a relatively short period of time,
> probably shorter than human reflexes can discern.
> 
> 
> William Lee Irwin III wrote:
>>> One might also consider debits owing to non-preemptibility or other
>>> sorts of cpu bandwidth overutilization with similar caveats. A half-
>>> life to an otherwise infinite history of credits and/or debits
>>> specified in absolute time may be appropriate, though it's not quite as
>>> computationally cheap as the above. The accounting for these credits
>>> and debits would take the place of ->last_taken_off_the_cpu above.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> Credits shouldn't be necessary with this model if, instead of using the 
>> average "on CPU" time per cycle for a pre-empted task when requeuing it, 
>> you use the time period that it was "an CPU" before it was booted off as 
>> this should compensate it correctly.
> 
> That was all about trying to do more in the vein of CPU bandwidth
> allocation enforcement. It's not the only attack on the problem, and
> may not even be a desirable one.
> 
> 
> William Lee Irwin III wrote:
>>> Another attack on the problem of CPU bandwidth misallocation would be
>>> to further credit or debit the task according to the ratio of actual
>>> CPU bandwidth to allocated CPU bandwidth in order to compensate
>>> for prior failures to enforce the allocation.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> The fact that the allocated CPU bandwidth is changing continuously makes 
>> this not a good idea.  On a system where allocations were made directly 
>> rather being a side effect of "nice" and the number of runnable 
>> processes it might be the way to go.
> 
> Where it changes continuously the intended effect is some hysteresis.
> I'm thinking of the CPU-bound task vs. the task that sleeps 1/3 of the
> time and similar such cases. Interestingly, cfs seems to get the
> intended allocation by maintaining an effectively infinite history.
> This is another reason I think it's important to strive for the CPU
> bandwidth allocation "correctness;" it's already been accomplished in
> some respects, albeit without relative prioritization.
> 
> Maybe a good test for whether an extension for relative prioritization
> works would be to make the load weights for each nice level configurable
> via /proc/ and then see if the algorithm successfully allocates shares
> of CPU bandwidth in various cases accordingly.

Yes.  I think that we can fiddle fairly freely with these (subject to 
the fact that they currently cache part of the load balancer's average 
load calculation's need to multiply by SCHED_LOAD_SCALE in their value). 
  From the POV of the smpnice of the load balancer it just wants the 
values to reflect the system's interpretation of nice.  This used to be 
done by tying the values to the time slice allocations associated with 
nice but that is now history so we're free to redesign it.  As you say, 
having the ability to easily try different mappings would make it easier 
to experiment and find the best mapping.

> 
> 
> William Lee Irwin III wrote:
>>> The rest and/or stronger
>>> attempts to factor in sleeping time or bandwidth misallocation I don't
>>> consider so significant, at least not without some demonstration there
>>> is CPU bandwidth misallocation happening.
> 
> Right here I sort of downplayed trying too hard to enforce things, at
> least computationally.

I agree with that approach.  There's no sense spending too much effort 
being extremely precise when the input you use is probably invalid by 
the time you apply the output.

> 
> 
> William Lee Irwin III wrote:
>>> I don't know. It sounds rather standard to me.
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> That's good to hear.
>> I also think that (as Ingo would probably prefer) that this mechanism 
>> might work without using averages i.e. just use the last "on CPU" time. 
>>  If it does work it would have the advantage of being impervious to the 
>> possible bad effects of CPU speed changes on those systems where this is 
>> a feature.  When using averages, I think that special consideration 
>> would need to be made for variable cpu speeds -- probably not that 
>> difficult but additional overhead anyway.
> 
> Timekeeping needs adjustments for all that, so the scheduler should
> just be able to call the timekeeping functions that need to be written
> for it anyway.

Unfortunately not.  Things like sleep and wait time need to be real time 
and only "on CPU" time needs to be "equivalent full speed on CPU time" 
in metrics. With a conversion back the other way when used to work out 
expected "on CPU" time.

> 
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> I've been giving some thought into a mechanism for pre-empting the 
>> current task when another task's expected "on CPU" time arrives.  I 
>> think that it would be sufficient for scheduler_tick() to compare the 
>> current time (via sched_clock() or whatever) with the "on CPU" for the 
>> next task on the queue and initiation pre-emption if it's later than 
>> that time.  This increases the granularity a little but is compensated 
>> for by the fact it will help with the situation where more than one task 
>> on the queue has the same expected "on CPU" time in that each of them 
>> would get at least a tick (if they needed it).
> 
> Another nicety is that you can pretty much predict when you're going to
> need to preempt, so you can nuke the periodic scheduling interrupt and
> use a one-shot timer for whenever the next preemption is meant to occur.
> This nets power savings and less interrupt overhead for userspace.

I think the one shot timer is dangerous as it can possibly lead to 
infinite pre-emption loops if several tasks end up with the same "on 
CPU" and the periodic interrupt is therefore better.  Like you I feel 
sad about this as I hoped that scheduler_tick() would disappear.

> 
> 
> On Sat, Apr 21, 2007 at 10:23:07AM +1000, Peter Williams wrote:
>> If some form of precise timer was used (instead) to trigger pre-emption 
>> then, where there is more than one task with the same expected "on CPU" 
>> time, only the last would get any CPU (and that's only the case if some 
>> infinite loop doesn't get created).
> 
> I think you can check for this and adjust their on-cpu times in various
> ways and choose what order to run them in. I think it'd end up something
> of a missed deadline policy.

Arguably the latest one should go first as its estimate is more likely 
to be correct being based on more recent info i.e. the other's estimates 
would be based on less runnable tasks and be optimistic. This would 
involve pushing them down the queue and pushing items already on the 
queue downwards is likely to be expensive.  Which is why I prefer the 
less elegant solution of the periodical interrupt.

Of course, one solution might be to just add the adjustment to the key 
time on all tasks on the queue?

Peter
-- 
Peter Williams                                   pwil3058@...pond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/