[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070203082308.GA6748@elte.hu>
Date: Sat, 3 Feb 2007 09:23:08 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Zach Brown <zach.brown@...cle.com>, linux-kernel@...r.kernel.org,
linux-aio@...ck.org, Suparna Bhattacharya <suparna@...ibm.com>,
Benjamin LaHaise <bcrl@...ck.org>
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
* Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> On Sat, 3 Feb 2007, Ingo Molnar wrote:
> >
> > Well, in my picture, 'only if you block' is a pure thread
> > utilization decision: bounce a piece of work to another thread if
> > this thread cannot complete it. (if the kernel is lucky enough that
> > the user context told it "it's fine to do that".)
>
> Sure, you can do it that way too. But at that point, your argument
> that we shouldn't do it with fibrils is wrong: you'd still need
> basically the exact same setup that Zach does in his fibril stuff, and
> the exact same hook in the scheduler, testing the exact same value
> ("do we have a pending queue of work").
did i ever lose a single word of complaint about those bits? Those are
not an issue to me. They can be applied to kernel threads just as much.
As i babbled in the very first email about this topic:
| 1) improve our basic #1 design gradually. If something is a
| bottleneck, if the scheduler has grown too fat, cut some slack. If
| micro-threads or fibrils offer anything nice for our basic thread
| model: integrate it into the kernel.
i should have said explicitly that to flip user-space from one kernel
thread to another one (upon blocking or per request) is a nice thing and
we should integrate that into the kernel's thread model.
But really, being a scheduler guy i was much more concerned about the
duplication and problems caused by the fibril concept itself - which
duplication and complexity makes up 80% of Zach's submitted patchset.
For example this bit:
[PATCH 3 of 4] Teach paths to wake a specific void * target
would totally go away if we used kernel threads for this. In the fibril
approach this is where the mess starts. Either a 'normal' wakeup has to
wake up all fibrils, or we have to make damn sure that a wakeup that in
reality goes to a fibril is never woken via wake_up/wake_up_process.
( Furthremore, i tried to include user-space micro-threads in the
argument as well, which Evgeniy Polyako raised not so long ago related
to the kevent patchset. All these micro-thread things are of a similar
genre. )
i totally agree that the API /should/ be the main focus - but i didnt
pick the topic and most of the patchset's current size is due to the IMO
avoidable fibril concept.
regarding the API, i dont really agree with the current form and design
of Zach's interface.
fundamentally, the basic entity of this thing should be a /system call/,
not the artificial fibril thing:
+struct asys_call {
+ struct asys_result *result;
+ struct fibril fibril;
+};
i.e. the basic entity should be something that represents a system call,
with its up to 6 arguments, the later return code, state, flags and two
list entries:
struct async_syscall {
unsigned long nr;
unsigned long args[6];
long err;
unsigned long state;
unsigned long flags;
struct list_head list;
struct list_head wait_list;
unsigned long __pad[2];
};
(64 bytes on 32-bit, 128 bytes on 64-bit)
furthermore, i think this API should be fundamentally vectored and
fundamentally async, and hence could solve another issue as well:
submitting many little pieces of work of different IO domains in one go.
[ detail: there should be no traditional signals used at all (Zach's
stuff doesnt use them, and correctly so), only if the async syscall
that is performed generates a signal. ]
The normal and most optimal workflow should be a user-space ring-buffer
of these constant-size struct async_syscall entries:
struct async_syscall ringbuffer[1024];
LIST_HEAD(submitted);
LIST_HEAD(pending);
LIST_HEAD(completed);
the 3 list heads are both known to the kernel and to user-space, and are
actively managed by both. The kernel drives the execution of the async
system calls based on the 'submitted' list head (until it empties it)
and moves them over to the 'pending' list. User-space can complete async
syscalls based on the 'completed' list. (but a sycall can optinally be
marked as 'autocomplete' as well via the 'flags' field, in that case
it's not moved to the 'completed' list but simply removed from the
'pending' list. This can be useful for system calls that have some
implicit notification effect.)
( Note: optionally, a helper kernel-thread, when it finishes processing
a syscall, could also asynchronously check the 'submitted' list and
pick up new work. That would allow the submission of new syscalls
without any entry into the kernel. So for example on an SMT system,
this could result in essence one CPU could running in pure user-space
submitting async syscalls via the ringbuffer, while another CPU would
in essence be running pure kernel-space, executing those entries. )
another crutial bit is the waiting on pending work. But because every
pending syscall entity is either already completed or has a real kernel
thread associated with it, that bit is mostly trivial: user-space can
wait on 'any' pending syscall to complete, or it could wait for a
specific list of syscalls to complete (using the ->wait_list). It could
also wait on 'a minimum number of N syscalls to complete' - to create
batching of execution. And of course it can periodically check the
'completed' list head if it has a constant and highly parallel flow of
workload - that way the 'waiting' does not actually have to happen most
of the time.
Looks like we can hit many birds with this single stone: AIO, vectored
syscalls, finegrained system-call parallelism. Hm?
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists