linux-kernel - Re: RFC for a new Scheduling policy/class in the Linux-kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1247412708.6704.105.camel@laptop>
Date:	Sun, 12 Jul 2009 17:31:48 +0200
From:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
To:	Douglas Niehaus <niehaus@...c.ku.edu>
Cc:	Henrik Austad <henrik@...tad.us>,
	LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...e.hu>,
	Bill Huey <billh@...ppy.monkey.org>,
	Linux RT <linux-rt-users@...r.kernel.org>,
	Fabio Checconi <fabio@...dalf.sssup.it>,
	"James H. Anderson" <anderson@...unc.edu>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ted Baker <baker@...fsu.edu>,
	Dhaval Giani <dhaval.giani@...il.com>,
	Noah Watkins <jayhawk@....ucsc.edu>,
	KUSP Google Group <kusp@...glegroups.com>
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Sat, 2009-07-11 at 21:40 -0500, Douglas Niehaus wrote:
> Peter:
>     Perhaps you could expand on what you meant when you said:
> 
> 	Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor
> 	but SMP leaves things to be desired. Dhaval is currently working on a
> 	PEP implementation that will migrate all the blocked tasks to the
> 	owner's cpu, basically reducing it to the UP problem.
> 
> What is left to be desired with PEP on SMP? I am not saying it is 
> perfect, as I can think of a few things I would like to improve or 
> understand better, but I am curious what you have in mind.

Right, please don't take this as a critism against PEP, any scheme I
know of has enormous complications on SMP ;-)

But the thing that concerns me most, there seem to be a few O(n)
consequences. Suppose that for each resource (or lock) R_i, there is a
block graph G_i, which consists of n nodes and would be m deep.

Functionally (generalized) PIP and PEP are identical, their big
difference is that PIP uses waitqueues to encode the block graph G,
whereas PEP leaves everybody on the runqueue and uses the proxy field to
encode the block graph G.

The downside of PIP is that the waitqueue needs to re-implement the full
schedule function in order to evaluate the highest prio task on the
waitqueue. Ttraditionally this was rather easy, since you'd only
consider the limited SCHED_FIFO static prio range, leaving you with a
O(1) evaluation, when you add more complex scheduling functions things
get considerably more involved. Let's call this cost S.

So for PIP you get O(m*S) evaluations whenever you get a change to the
block graph.

Now for PEP, you get an increased O(m) cost on schedule, which can be
compared to the PIP cost.

However PEP on SMP needs to ensure all n tasks in G_i are on the same
cpu, because otherwise we can end up wanting to execute the resource
owner on multiple cpus at the same time, which is bad.

This can of course be amortized, but you end up having to migrate the
task (or an avatar thereof) to the owner's cpu (if you'd want to migrate
the owner to the blocker's cpu, then you're quickly into trouble when
there's multiple blockers), but any way around this ends up being O(n).

Also, when the owner gets blocked on something that doesn't have an
owner (io completion, or a traditional semaphore), you have to take all
n tasks from the runqueue (and back again when they do become runnable).

PIP doesn't suffer this, but does suffer the pain from having to
reimplement the full schedule function on the waitqueues, which when you
have hierarchical scheduling means you have to replicate the full
hierarchy per waitqueue.

Furthermore we cannot assume locked sections are short, and we must
indeed assume that its any resource in the kernel associated with any
service which can be used by any thread, worse, it can be any odd
userspace resource/thread too, since we expose the block graph to
userspace processes through PI-futexes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/