linux-kernel - Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 18 Apr 2007 12:39:07 +1000
From:	Peter Williams <pwil3058@...pond.net.au>
To:	"Michael K. Edwards" <medwards.linux@...il.com>
CC:	Ingo Molnar <mingo@...e.hu>, Nick Piggin <npiggin@...e.de>,
	Mike Galbraith <efault@....de>,
	Con Kolivas <kernel@...ivas.org>, ck list <ck@....kolivas.org>,
	Bill Huey <billh@...ppy.monkey.org>,
	linux-kernel@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Arjan van de Ven <arjan@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair
 Scheduler [CFS]

Michael K. Edwards wrote:
> On 4/17/07, Peter Williams <pwil3058@...pond.net.au> wrote:
>> The other way in which the code deviates from the original as that (for
>> a few years now) I no longer calculated CPU bandwidth usage directly.
>> I've found that the overhead is less if I keep a running average of the
>> size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
>> from on CPU one time to on CPU next time) and using the ratio of these
>> values as a measure of bandwidth usage.
>>
>> Anyway it works and gives very predictable allocations of CPU bandwidth
>> based on nice.
> 
> Works, that is, right up until you add nonlinear interactions with CPU
> speed scaling.  From my perspective as an embedded platform
> integrator, clock/voltage scaling is the elephant in the scheduler's
> living room.  Patch in DPM (now OpPoint?) to scale the clock based on
> what task is being scheduled, and suddenly the dynamic priority
> calculations go wild.  Nip this in the bud by putting an RT priority
> on the relevant threads (which you have to do anyway if you need
> remotely audio-grade latency), and the lock affinity heuristics break,
> so you have to hand-tune all the thread priorities.  Blecch.
> 
> Not to mention the likelihood that the task whose clock speed you're
> trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
> than the application.  (You want to crank the CPU for this task
> because it runs with the RF hot, which may cost you as much power as
> the rest of the platform.)  You'd better hope you can remove it from
> the dynamic priority heuristics with SCHED_BATCH.  Otherwise
> everything _else_ has to be RT priority (or it'll be starved by the
> soft MAC) and you've basically tossed SCHED_NORMAL in the bin.  Double
> blecch!
> 
> Is it too much to ask for someone with actual engineering training
> (not me, unfortunately) to sit down and build a negative-feedback
> control system that handles soft-real-time _and_ dynamic-priority
> _and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
> scaling?  And actually separates the accounting and control mechanisms
> from the heuristics, so the latter can be tuned (within a well
> documented stable range) to reflect the expected system usage
> patterns?
> 
> It's not like there isn't a vast literature in this area over the past
> decade, including some dealing specifically with clock scaling
> consistent with low-latency applications.  It's a pity that people
> doing academic work in this area rarely wade into LKML, even when
> they're hacking on a Linux fork.  But then, there's not much economic
> incentive for them to do so, and they can usually get their fill of
> citation politics and dominance games without leaving their home
> department.  :-P
> 
> Seriously, though.  If you're really going to put the mainline
> scheduler through this kind of churn, please please pretty please knit
> in per-task clock scaling (possibly even rejigged during the slice;
> see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
> linger mechanism to keep from taking context switch hits when you're
> confident that an I/O will complete quickly.

I think that this doesn't effect the basic design principles of spa_ebs 
but just means that the statistics that it uses need to be rethought. 
E.g. instead of measuring average CPU usage per burst in terms of wall 
clock time spent on the CPU measure it in terms of CPU capacity (for the 
want of a better word) used per burst.

I don't have suitable hardware for investigating this line of attack 
further, unfortunately, and have no idea what would be the best way to 
calculate this new statistic.

Peter
-- 
Peter Williams                                   pwil3058@...pond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/