linux-kernel - Re: [patch 05/11] syslets: core code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070216122806.GA27455@elte.hu>
Date:	Fri, 16 Feb 2007 13:28:06 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Evgeniy Polyakov <johnpol@....mipt.ru>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Arjan van de Ven <arjan@...radead.org>,
	Christoph Hellwig <hch@...radead.org>,
	Andrew Morton <akpm@....com.au>,
	Alan Cox <alan@...rguk.ukuu.org.uk>,
	Ulrich Drepper <drepper@...hat.com>,
	Zach Brown <zach.brown@...cle.com>,
	"David S. Miller" <davem@...emloft.net>,
	Benjamin LaHaise <bcrl@...ck.org>,
	Suparna Bhattacharya <suparna@...ibm.com>,
	Davide Libenzi <davidel@...ilserver.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [patch 05/11] syslets: core code


* Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Thu, 15 Feb 2007, Linus Torvalds wrote:
> > 
> > So I think that a good implementation just does everything up-front, 
> > and doesn't _need_ a user buffer that is live over longer periods, 
> > except for the actual results. Exactly because the whole 
> > alloc/teardown is nasty.
> 
> Btw, this doesn't necessarily mean "not supporting multiple atoms at 
> all".
> 
> I think the batching of async things is potentially a great idea. I 
> think it's quite workable for "open+fstat" kind of things, and I agree 
> that it can solve other things too (the 
> "socket+bind+connect+sendmsg+rcv" kind of complex setup things).
> 
> But I suspect that if we just said:
>  - we limit these atom sequences to just linear sequences of max "n" ops
>  - we read them all in in a single go at startup
>
> we actually avoid several nasty issues. Not just the memory allocation 
> issue in user space (now it's perfectly ok to build up a sequence of 
> ops in temporary memory and throw it away once it's been submitted), 
> but also issues like the 32-bit vs 64-bit compatibility stuff (the 
> compat handlers would just convert it when they do the initial 
> copying, and then the actual run-time wouldn't care about user-level 
> pointers having different sizes etc).
> 
> Would it make the interface less cool? Yeah. Would it limit it to just 
> a few linked system calls (to avoid memory allocation issues in the 
> kernel)? Yes again. But it would simplify a lot of the interface 
> issues.
> 
> It would _also_ allow the "sys_aio_read()" function to build up its 
> *own* set of atoms in kernel space to actually do the read, and there 
> would be no impact of the actual run-time wanting to read stuff from 
> user space. Again - it's actually the same issue as with the compat 
> system call: by making the interfaces do things up-front rather than 
> dynamically, it becomes more static, but also easier to do interface 
> translations. You can translate into any arbitrary internal format 
> _once_, and be done with it.
> 
> I dunno.

[ hm. I again wrote a pretty long email for you to read. Darn! ]

regarding the API - i share most of your concerns, and it's all a 
function of how widely we want to push this into user-space.

My initial thought was for syslets to be used by glibc as small, secure 
kernel-side 'syscall plugins' mainly - so that it can do things like 
'POSIX AIO signal notifications' (which are madness in terms of 
performance, but which applications rely on) /without/ having to burden 
the kernel-side AIO with such requirements: glibc just adds an enclosing 
sys_kill() to the syslet and it will do the proper signal notification, 
asynchronously. (and of course syslets can be used for the Tux type of 
performance sillinesses as well ;-)

So a sane user API (all used at the glibc level, not at application 
level) would use simple syslets, while more broken ones would have to 
use longer ones - but nobody would have the burden of having to 
synchronize back to the issuer context. Natural selection will gravitate 
application use towards the APIs with the shorter syslets. (at least so 
i hope)

In this model syslets arent really user-programmable entities but rather 
small plugins available to glibc to build up more complex, more 
innovative (or just more broken) APIs than what the kernel wants to 
provide - without putting any true new ABI dependency on the kernel, 
other than the already existing syscall ABIs.

But if we'd like glibc to provide this to applications in some sort of 
standardized /programmable/ manner, with a wide range of atom selections 
(not directly coded syscall numbers, but rather as function pointers to 
actual glibc functions, which glibc could translate to syscall numbers, 
argument encodings, etc.), then i agree that doing the compat things and 
making it 32/64-bit agnostic (and much more) is pretty much a must. If 
90% of this current job is finished then sorting those out will at least 
be another 90% of the work ;-)

and actually this latter model scares me, and i think that model scared 
the hell out of you as well.

But i really have no strong opinion about which one we want yet, without 
having walked the path. Somewhere inside me i'd of course like syslets 
to become a widely available interface - but my fear is that it might 
just not be 'human' enough to make sense - and we'd just not want to tie 
us down with an ABI that's not used. I dont want this to become another 
sys_sendfile - much talked about and _almost_ useful but in practice 
seldom used due to its programmability and utility limitations.

OTOH, the syslet concept right now already looks very ubiquitous, and 
the main problem with AIO use in applications wasnt just even its broken 
API or its broken performance, but the fundamental lack of all Linux IO 
disciplines supporting AIO, and the lack of significantly parallel 
hardware. We have kaio that is centered around block drivers - then we 
have epoll that works best with networking, and inotify that deals with 
some (but not all) VFS events - but neither supports every IO and event 
disciple well, at once. My feeling is that /this/ is the main 
fundamental problem with AIO in general, not just its programmability 
limitations.

Right now i'm concentrating on trying to build up something on the 
scheduling side that shows the issues in practice, shows the limitations 
and shows the possibilities. For example the easy ability to turn a 
cachemiss thread back into a user thread (and then back into a cachemiss 
thread) was a true surprise to me which increased utility quite a bit. I 
couldnt have designed it into the concept because it just didnt occur to 
me in the early stages. The notification ring related limitations you 
noticed is another important thing to fix - and these issues go to the 
core scheduling model of the concept and affect everything.

Thirdly, while Tux does not matter much to us, at least to me it is 
pretty clear what it takes to get performance up to the levels of Tux - 
and i dont see any big fundamental compromise possible on that front. 
Syslets are partly Tux repackaged into something generic - they are 
probably a bit slower than straight kernel code Tux, but not by much and 
it's also not behaving fundamentally differently. And if we dont offer 
at least something close to those possibilities then people will 
re-start trying to add those special-purpose state machine APIs again, 
and the whole "we need /true/ async IO" game starts again.

So if we accept "make parallelism easier to program" and "get somewhat 
close to Tux's performance and scalability" as a premise (which you 
might not agree with in that form), then i dont think there's much 
choice we have: either we use kernel threads, synchronous system calls 
and the scheduler intelligently (and the scheduling/threading bits of 
syslets are pretty much the most intelligent kernel thread based 
approach i can imagine at the moment =B-) or we use a special-purpose 
KAIO state machine subsystem, avoiding most of the existing synchronous 
infrastructure, painfully coding it into every IO discipline - and this 
will certainly haunt us until the end of times.

So that's why i'm not /that/ much worried about the final form of the 
API at the moment - even though i agree that it is /the/ most important 
decision factor in the end: i see various unavoidable externalities 
forcing us very much, and in the end we either like the result and make 
it available to programmers, or we dont, and limit it to system-glue 
glibc use - or we throw it away altogether. I'm curious about the end 
result even if it gets limited or gets thrown away (joining 4:4 on the 
way to the bit bucket ;) and while i'm cautiously optimistic that 
something useful can come out of this, i cannot know it for sure at the 
moment.

	Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/