[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070131084331.GA3404@elte.hu>
Date: Wed, 31 Jan 2007 09:43:31 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Nick Piggin <nickpiggin@...oo.com.au>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Zach Brown <zach.brown@...cle.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-aio@...ck.org, Suparna Bhattacharya <suparna@...ibm.com>,
Benjamin LaHaise <bcrl@...ck.org>,
Ulrich Drepper <drepper@...hat.com>
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
* Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> [ Of course, that used to also be the claim by all the people who
> thought we couldn't do native kernel threads for "normal" threading
> either, and should go with the n*m setup. Shows how much they knew
> ;^]
>
> But I've certainly _personally_ always wanted to do AIO with threads.
> I wanted to do it with regular threads (ie the "clone()" kind). It
> didn't fly. But I think we can possibly lower both the setup costs and
> the memory costs with the cooperative approach, to the point where
> maybe this one is more palatable and workable.
as you know i've been involved in all the affected IO and scheduling
disciplines in question: i hacked on scheduling, 1:1 threading, on Tux,
i even optimized kaio under TPC workloads, and various other things. So
i believe to have seen most of these things first-hand and thus i
(pretend to) have no particular prejudice or other subjective
preference. In the following, rather long (and boring) analysis i touch
upon both 1:1 threading, light threads, state machines and AIO.
The first and most important thing is that i think there are only /two/
basic, fundamental IO approaches that matter to kernel design:
1) the most easy to program one.
2) the fastest one.
1), the most easily programmed model: this is tasks and synchronous IO.
Samba (and Squid, and even Apache most of the time ...) is still using
plain old /processes/ and is doing fine! Most of the old LWP arguments
are not really true anymore, TLB switches are very fast meanwhile (per
hardware improvements) and our scheduler is still pretty light.
Furthermore, for certain specialized environments that are able isolate
the programmer from the risks and complexities of thread programming,
threading can be somewhat faster (such as Java or C#). But for the
general C/C++ based server environment even threading is pure hell. (We
know it from the kernel that programming a threaded design is /hard/,
needs alot of discipline and takes alot more resources than any of the
saner models.)
2) the fastest model: is a pure, minimal state-machine tailored to the
task/workload we want to do. Tux does that (and i'm keeping it uptodate
for current 2.6 kernels too), and it still beats the pants off anything
user-space.
But reality is that most people care more about programmability: 300
MB/sec or 500 MB/sec web throughput limit (or a serving limit 30K
requests in parallel or 60K requests in parallel - whatever the
performance metric you care about is) isnt really making any difference
in all but the most specialized environments - /because/ the price to
pay for it is much worse programmability (and hence much worse
flexibility). [ And this is all about basic economy: if it's 10 times
harder to program something and what takes 1 month to program in one
model takes 10 months to program in the other model - but in 10 months
the hardware already got another 50% faster, likely offsetting the
advantage of that 'other' model to begin with ... ]
Note that often the speed difference between task based and state based
designs is alot smaller. But God is the programming complexity
difference huge. Not even huge but /HUGE/. All our programming languages
are procedural and stack based. (Not only is there a basic lack of
state-based programming languages, it's a product of our brain
structure: state machines are not really what the human mind is
structured for, but i digress.)
Now where do all these LWP, fibre, firbril, micro-thread or N:M concepts
fit? Most of the time they are just a /weakening/ of the #1 concept. And
that's why they will lose out, because #1 is all about programmability
and they dont offer anything new: because they cannot. Either you go for
programmability or you go for performance. There is /no/ middle ground
for us in the kernel! (User-space can do the middle ground thing, but
the kernel must only be structured around the two most fundamental
concepts that exist. [NOTE: more about KAIO later.])
Having a 1:1 relationship between user-space and kernel-space context is
a strong concept, and on modern CPUs we perform /process/ context
switches in 650 nsecs, we do user thread context switches in 500 nsecs,
and kernel<->kernel thread context switches in 350 nsecs. That's pretty
damn fast.
The big problem is, the M:N concepts (and i consider micro-threads or
'schedulable stacks' to be of that type) tend to operate the following
way: they build up hype by being "faster" - but the performance
advantage only comes from weakening #1! (for example by not making the
contexts generally schedulable, bindable, load-balancable, suspendable,
migratable, etc.) But then a few years later they grow (by virtue of
basic pressure from programmers, who /want/ a clean easily programmable
concept #1) the /very same/ features that they "optimized away" in the
beginning. With the difference that now all that infrastructure they
left out initially is replicated in user-space, in a poorer and
inconsistent way, blowing up cache-size, and slowing the /sane/ things
down in the end. (and then i havent even begun about the nightmare of
extending the debug infrastructure to light threads, ABI dependencies,
etc, etc...)
One often repeated (because pretty much only) performance advantage of
'light threads' is context-switch performance between user-space
threads. But reality is, nobody /cares/ about being able to
context-switch between "light user-space threads"! Why? Because there
are only two reasons why such a high-performance context-switch would
occur:
1) there's contention between those two tasks. Wonderful: now two
artificial threads are running on the /same/ CPU and they are even
contending each other. Why not run a single context on a single CPU
instead and only get contended if /another/ CPU runs a conflicting
context?? While this makes for nice "pthread locking benchmarks",
it is not really useful for anything real.
2) there has been an IO event. The thing is, for IO events we enter the
kernel no matter what - and we'll do so for the next 10 years at
minimum. We want to abstract away the hardware, we want to do
reliable resource accounting, we want to share hardware resources,
we want to rate-limit, etc., etc. While in /theory/ you could handle
IO purely from user-space, in practice you dont want to do that. And
if we accept the premise that we'll enter the kernel anyway, there's
zero performance difference between scheduling right there in the
kernel, or returning back to user-space to schedule there. (in fact
i submit that the former is faster). Or if we accept the theoretical
possibility of 'perfect IO hardware' that implements /all/ the
features that the kernel wants (in a secure and generic way, and
mind you, such IO hardware does not exist yet), then /at most/ the
performance advantage of user-space doing the scheduling is the
overhead of a null syscall entry. Which is a whopping 100 nsecs on
modern CPUs! That's roughly the latency of a /single/ DRAM access!
Furthermore, 'light thread' concepts can no way approach the performance
of #2 state-machines: if you /know/ what the structure of your context
is, and you can program it in a specialized state-machine way, there's
just so many shortcuts possible that it's not even funny.
> And maybe it also solves some of the scalability worries (threads have
> ID space and scheduling setup things that essentially go away by just
> not doing them - which is what the fibrils simply wouldn't have).
our PID management is very fast - i've never seen it show up in
profiles. (We could make it even more scalable if the need arises, for
example right now we still maintain the luxory of having a globally
increasing and tightly compressed PID/TID space.)
until a few days ago we even had a /global lock/ in the thread/task
creation and destruction hotpath since 2.5.43 (when oprofile support was
added, more than 4 years ago, the oprofile exit notifier lock) and
nobody really noticed or cared!
so i believe there are only two directions that make sense in the long
run:
1) improve our basic #1 design gradually. If something is a bottleneck,
if the scheduler has grown too fat, cut some slack. If micro-threads
or fibrils offer anything nice for our basic thread model: integrate
it into the kernel. (Dont worry about having the enter the kernel for
that, we /want/ that isolation! Resource control /needs/ a secure and
trustable context. Good debuggability /needs/ a secure and trustable
context. Hardware makers will make damn sure that such hardware-based
isolation is fast as lightning. SYSENTER will continue to be very
fast.) In any case, unless someone can answer the issues i raised
above, please dont mess with the basic 1:1 task model we have. It's
really what we want to have.
2) enable crazy people to do IO in the #2 way. For this i think the most
programmable interface is KAIO - because the 'context' /only/
involves the IO entity itself, which is simple enough for our brain
to wrap itself around. (and the hardware itself has the concept of
'in flight IO' anyway, so people are forced to learn about it and to
deal with it anyway, even in the synchronous IO model. So there's
quite some mindshare to build upon.) This also happens to realize
/most/ of the performance advantages that a state-machine like Tux
tends to have. (In that sense i dont see epoll to be conceptually
different from KAIO, although internally within the kernel it's quite
a bit different.)
now, KAIO 'behind the hood' (within the kernel) is like /very/ hard to
program in a fully state-driven manner - but we are crazy folks who
/might/ be able to pull it off. Initially we should just simulate the
really hard bits via a pool of in-kernel synchronous threads. Some IO
disciplines are already state-driven internally (networking most
notably, but also timers), so there KAIO can be implemented 'natively'.
And it will be /faster/ than micro-threads based AIO!
but frankly, KAIO is the most i can see a normal, sane human programmer
to be able to deal with. Any more exposure to state-machines will just
drive them crazy, and they'll lynch us (or more realistically: ignore
us) if we try anything more. And with KAIO they still have the /option/
to make all their user-space functionality state-driven. Also, abstract,
fully managed programming environments like Java can hide KAIO
complexities by implementing their synchronous primitives using KAIO.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists