linux-kernel - Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f2b55d220704171600w2de57cafhc1b2d8d7df67cc6b@mail.gmail.com>
Date:	Tue, 17 Apr 2007 16:00:53 -0700
From:	"Michael K. Edwards" <medwards.linux@...il.com>
To:	"Peter Williams" <pwil3058@...pond.net.au>
Cc:	"Ingo Molnar" <mingo@...e.hu>, "Nick Piggin" <npiggin@...e.de>,
	"Mike Galbraith" <efault@....de>,
	"Con Kolivas" <kernel@...ivas.org>, "ck list" <ck@....kolivas.org>,
	"Bill Huey" <billh@...ppy.monkey.org>,
	linux-kernel@...r.kernel.org,
	"Linus Torvalds" <torvalds@...ux-foundation.org>,
	"Andrew Morton" <akpm@...ux-foundation.org>,
	"Arjan van de Ven" <arjan@...radead.org>,
	"Thomas Gleixner" <tglx@...utronix.de>
Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

On 4/17/07, Peter Williams <pwil3058@...pond.net.au> wrote:
> The other way in which the code deviates from the original as that (for
> a few years now) I no longer calculated CPU bandwidth usage directly.
> I've found that the overhead is less if I keep a running average of the
> size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
> from on CPU one time to on CPU next time) and using the ratio of these
> values as a measure of bandwidth usage.
>
> Anyway it works and gives very predictable allocations of CPU bandwidth
> based on nice.

Works, that is, right up until you add nonlinear interactions with CPU
speed scaling.  From my perspective as an embedded platform
integrator, clock/voltage scaling is the elephant in the scheduler's
living room.  Patch in DPM (now OpPoint?) to scale the clock based on
what task is being scheduled, and suddenly the dynamic priority
calculations go wild.  Nip this in the bud by putting an RT priority
on the relevant threads (which you have to do anyway if you need
remotely audio-grade latency), and the lock affinity heuristics break,
so you have to hand-tune all the thread priorities.  Blecch.

Not to mention the likelihood that the task whose clock speed you're
trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
than the application.  (You want to crank the CPU for this task
because it runs with the RF hot, which may cost you as much power as
the rest of the platform.)  You'd better hope you can remove it from
the dynamic priority heuristics with SCHED_BATCH.  Otherwise
everything _else_ has to be RT priority (or it'll be starved by the
soft MAC) and you've basically tossed SCHED_NORMAL in the bin.  Double
blecch!

Is it too much to ask for someone with actual engineering training
(not me, unfortunately) to sit down and build a negative-feedback
control system that handles soft-real-time _and_ dynamic-priority
_and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
scaling?  And actually separates the accounting and control mechanisms
from the heuristics, so the latter can be tuned (within a well
documented stable range) to reflect the expected system usage
patterns?

It's not like there isn't a vast literature in this area over the past
decade, including some dealing specifically with clock scaling
consistent with low-latency applications.  It's a pity that people
doing academic work in this area rarely wade into LKML, even when
they're hacking on a Linux fork.  But then, there's not much economic
incentive for them to do so, and they can usually get their fill of
citation politics and dominance games without leaving their home
department.  :-P

Seriously, though.  If you're really going to put the mainline
scheduler through this kind of churn, please please pretty please knit
in per-task clock scaling (possibly even rejigged during the slice;
see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
linger mechanism to keep from taking context switch hits when you're
confident that an I/O will complete quickly.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/