[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070216122806.GA27455@elte.hu>
Date: Fri, 16 Feb 2007 13:28:06 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Evgeniy Polyakov <johnpol@....mipt.ru>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Arjan van de Ven <arjan@...radead.org>,
Christoph Hellwig <hch@...radead.org>,
Andrew Morton <akpm@....com.au>,
Alan Cox <alan@...rguk.ukuu.org.uk>,
Ulrich Drepper <drepper@...hat.com>,
Zach Brown <zach.brown@...cle.com>,
"David S. Miller" <davem@...emloft.net>,
Benjamin LaHaise <bcrl@...ck.org>,
Suparna Bhattacharya <suparna@...ibm.com>,
Davide Libenzi <davidel@...ilserver.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [patch 05/11] syslets: core code
* Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> On Thu, 15 Feb 2007, Linus Torvalds wrote:
> >
> > So I think that a good implementation just does everything up-front,
> > and doesn't _need_ a user buffer that is live over longer periods,
> > except for the actual results. Exactly because the whole
> > alloc/teardown is nasty.
>
> Btw, this doesn't necessarily mean "not supporting multiple atoms at
> all".
>
> I think the batching of async things is potentially a great idea. I
> think it's quite workable for "open+fstat" kind of things, and I agree
> that it can solve other things too (the
> "socket+bind+connect+sendmsg+rcv" kind of complex setup things).
>
> But I suspect that if we just said:
> - we limit these atom sequences to just linear sequences of max "n" ops
> - we read them all in in a single go at startup
>
> we actually avoid several nasty issues. Not just the memory allocation
> issue in user space (now it's perfectly ok to build up a sequence of
> ops in temporary memory and throw it away once it's been submitted),
> but also issues like the 32-bit vs 64-bit compatibility stuff (the
> compat handlers would just convert it when they do the initial
> copying, and then the actual run-time wouldn't care about user-level
> pointers having different sizes etc).
>
> Would it make the interface less cool? Yeah. Would it limit it to just
> a few linked system calls (to avoid memory allocation issues in the
> kernel)? Yes again. But it would simplify a lot of the interface
> issues.
>
> It would _also_ allow the "sys_aio_read()" function to build up its
> *own* set of atoms in kernel space to actually do the read, and there
> would be no impact of the actual run-time wanting to read stuff from
> user space. Again - it's actually the same issue as with the compat
> system call: by making the interfaces do things up-front rather than
> dynamically, it becomes more static, but also easier to do interface
> translations. You can translate into any arbitrary internal format
> _once_, and be done with it.
>
> I dunno.
[ hm. I again wrote a pretty long email for you to read. Darn! ]
regarding the API - i share most of your concerns, and it's all a
function of how widely we want to push this into user-space.
My initial thought was for syslets to be used by glibc as small, secure
kernel-side 'syscall plugins' mainly - so that it can do things like
'POSIX AIO signal notifications' (which are madness in terms of
performance, but which applications rely on) /without/ having to burden
the kernel-side AIO with such requirements: glibc just adds an enclosing
sys_kill() to the syslet and it will do the proper signal notification,
asynchronously. (and of course syslets can be used for the Tux type of
performance sillinesses as well ;-)
So a sane user API (all used at the glibc level, not at application
level) would use simple syslets, while more broken ones would have to
use longer ones - but nobody would have the burden of having to
synchronize back to the issuer context. Natural selection will gravitate
application use towards the APIs with the shorter syslets. (at least so
i hope)
In this model syslets arent really user-programmable entities but rather
small plugins available to glibc to build up more complex, more
innovative (or just more broken) APIs than what the kernel wants to
provide - without putting any true new ABI dependency on the kernel,
other than the already existing syscall ABIs.
But if we'd like glibc to provide this to applications in some sort of
standardized /programmable/ manner, with a wide range of atom selections
(not directly coded syscall numbers, but rather as function pointers to
actual glibc functions, which glibc could translate to syscall numbers,
argument encodings, etc.), then i agree that doing the compat things and
making it 32/64-bit agnostic (and much more) is pretty much a must. If
90% of this current job is finished then sorting those out will at least
be another 90% of the work ;-)
and actually this latter model scares me, and i think that model scared
the hell out of you as well.
But i really have no strong opinion about which one we want yet, without
having walked the path. Somewhere inside me i'd of course like syslets
to become a widely available interface - but my fear is that it might
just not be 'human' enough to make sense - and we'd just not want to tie
us down with an ABI that's not used. I dont want this to become another
sys_sendfile - much talked about and _almost_ useful but in practice
seldom used due to its programmability and utility limitations.
OTOH, the syslet concept right now already looks very ubiquitous, and
the main problem with AIO use in applications wasnt just even its broken
API or its broken performance, but the fundamental lack of all Linux IO
disciplines supporting AIO, and the lack of significantly parallel
hardware. We have kaio that is centered around block drivers - then we
have epoll that works best with networking, and inotify that deals with
some (but not all) VFS events - but neither supports every IO and event
disciple well, at once. My feeling is that /this/ is the main
fundamental problem with AIO in general, not just its programmability
limitations.
Right now i'm concentrating on trying to build up something on the
scheduling side that shows the issues in practice, shows the limitations
and shows the possibilities. For example the easy ability to turn a
cachemiss thread back into a user thread (and then back into a cachemiss
thread) was a true surprise to me which increased utility quite a bit. I
couldnt have designed it into the concept because it just didnt occur to
me in the early stages. The notification ring related limitations you
noticed is another important thing to fix - and these issues go to the
core scheduling model of the concept and affect everything.
Thirdly, while Tux does not matter much to us, at least to me it is
pretty clear what it takes to get performance up to the levels of Tux -
and i dont see any big fundamental compromise possible on that front.
Syslets are partly Tux repackaged into something generic - they are
probably a bit slower than straight kernel code Tux, but not by much and
it's also not behaving fundamentally differently. And if we dont offer
at least something close to those possibilities then people will
re-start trying to add those special-purpose state machine APIs again,
and the whole "we need /true/ async IO" game starts again.
So if we accept "make parallelism easier to program" and "get somewhat
close to Tux's performance and scalability" as a premise (which you
might not agree with in that form), then i dont think there's much
choice we have: either we use kernel threads, synchronous system calls
and the scheduler intelligently (and the scheduling/threading bits of
syslets are pretty much the most intelligent kernel thread based
approach i can imagine at the moment =B-) or we use a special-purpose
KAIO state machine subsystem, avoiding most of the existing synchronous
infrastructure, painfully coding it into every IO discipline - and this
will certainly haunt us until the end of times.
So that's why i'm not /that/ much worried about the final form of the
API at the moment - even though i agree that it is /the/ most important
decision factor in the end: i see various unavoidable externalities
forcing us very much, and in the end we either like the result and make
it available to programmers, or we dont, and limit it to system-glue
glibc use - or we throw it away altogether. I'm curious about the end
result even if it gets limited or gets thrown away (joining 4:4 on the
way to the bit bucket ;) and while i'm cautiously optimistic that
something useful can come out of this, i cannot know it for sure at the
moment.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists