[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070214103731.GB6801@elte.hu>
Date: Wed, 14 Feb 2007 11:37:31 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Evgeniy Polyakov <johnpol@....mipt.ru>
Cc: Benjamin LaHaise <bcrl@...ck.org>, Alan <alan@...rguk.ukuu.org.uk>,
linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
Arjan van de Ven <arjan@...radead.org>,
Christoph Hellwig <hch@...radead.org>,
Andrew Morton <akpm@....com.au>,
Ulrich Drepper <drepper@...hat.com>,
Zach Brown <zach.brown@...cle.com>,
"David S. Miller" <davem@...emloft.net>,
Suparna Bhattacharya <suparna@...ibm.com>,
Davide Libenzi <davidel@...ilserver.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
* Evgeniy Polyakov <johnpol@....mipt.ru> wrote:
> Let me clarify what I meant. There is only limited number of threads,
> which are supposed to execute blocking context, so when all they are
> used, main one will block too - I asked about possibility to reuse the
> same thread to execute queue of requests attached to it, each request
> can block, but if blocking issue is removed, it would be possible to
> return.
ah, ok, i understand your point. This is not quite possible: the
cachemisses are driven from schedule(), which can be arbitraily deep
inside arbitrary system calls. It can be in a mutex_lock() deep inside a
driver. It can be due to a alloc_pages() call done by a kmalloc() call
done from within ext3, which was called from the loopback block driver,
which was called from XFS, which was called from a VFS syscall.
Even if it were possible to backtrack i'm quite sure we dont want to do
this, for three main reasons:
Firstly, backtracking and retrying always has a cost. We construct state
on the way in - and we destruct on the way out. The kernel stack we have
built up has a (nontrivial) construction cost and thus a construction
value - we should preserve that if possible.
Secondly, and this is equally important: i wanted the number of async
kernel threads to be the natural throttling mechanism. If user-space
wants to use less threads and overcommit the request queue then it can
be done in user-space: by over-queueing requests into a separate list,
and taking from that list upon completion and submitting it. User-space
has precise knowledge of overqueueing scenarios: if the event ring is
full then all async kernel threads are busy.
but note that there's a deeper reason as well for not wanting
over-queueing: the main cost of a 'pending request' is the kernel stack
of the blocked thread itself! So do we want to allow 'requests' to stay
'pending' even if there are "no more threads available"? Nope: because
letting them 'pend' would essentially (and implicitly) mean an increase
of the thread pool.
In other words: with the syslet subsystem, a kernel thread /is/ the
asynchronous request itself. So 'have more requests pending' means 'have
more kernel threads'. And 'no kernel thread available' must thus mean
'no queueing of this request'.
Thirdly, there is a performance advantage of this queueing property as
well: by letting a cachemiss thread only do a single syslet all work is
concentrated back to the 'head' task, and all queueing decisions are
immediately known by user-space and can be acted upon.
So the work-queueing setup is not symmetric at all, there's a
fundamental bias and tendency back towards the head task - this helps
caching too. That's what Tux did too - it always tried to queue back to
the 'head task' as soon as it could. Spreading out work dynamically and
transparently is necessary and nice, but it's useless if the system has
no automatic tendency to move back into single-threaded (fully cached)
state if the workload becomes less parallel. Without this fundamental
(and transparent) 'shrink parallelism' property syslets would only
degrade into yet another threading construct.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists