[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <117BE5D6-146E-407D-887E-067F212BA871@oracle.com>
Date: Thu, 15 Feb 2007 11:16:18 -0800
From: Zach Brown <zach.brown@...cle.com>
To: bert hubert <bert.hubert@...herlabs.nl>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Evgeniy Polyakov <johnpol@....mipt.ru>,
Ingo Molnar <mingo@...e.hu>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Arjan van de Ven <arjan@...radead.org>,
Christoph Hellwig <hch@...radead.org>,
Andrew Morton <akpm@....com.au>,
Alan Cox <alan@...rguk.ukuu.org.uk>,
Ulrich Drepper <drepper@...hat.com>,
"David S. Miller" <davem@...emloft.net>,
Benjamin LaHaise <bcrl@...ck.org>,
Suparna Bhattacharya <suparna@...ibm.com>,
Davide Libenzi <davidel@...ilserver.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [patch 05/11] syslets: core code
> 2) On the client facing side (port 53), I'd very much hope for a
> way to
> do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
> UDP datagrams with only one kernel transition.
I want to highlight this point that Bert is making.
Whenever we talk about AIO and kernel threads some folks are rightly
concerned that we're talking about a thread *per IO* and fear that
memory consumption will be fatal.
Take the case of userspace which implements what we'd think of as
page cache writeback. (*coughs, points at email address*). It wants
to issue thousands of IOs to disjoint regions of a file. "Thousands
of kernel threads, oh crap!"
But it only issues each IO with a separate syscall (or io_submit()
op) because it doesn't have an interface that lets it specify IOs
that vector user memory addresses *and file position*.
If we had a seemingly obvious interface that let it kick off batched
IOs to different parts of the file, the looming disaster of a thread
per IO vanishes in that case.
struct off_vec {
off_t pos;
size_t len;
};
long sys_sgwrite(int fd, struct iovec *memvec, size_t mv_count,
struct off_vec *ovec, size_t ov_count);
It doesn't take long to imagine other uses for this that are less
exotic.
Take e2fsck and its iterating through indirect blocks or directory
data blocks. It has a list of disjoint file regions (blocks) it
wants to read, but it does them serially to keep the code from
getting even more confusing. blktrace a clean e2fsck -f some time..
it's leaving *HALF* of the disk read bandwith on the table by
performing serial block-sized reads. If it could specify batches of
them the code would still be simple but it could tell the kernel and
IO scheduler *exactly* what it wants, without having to mess around
with sys_readahead() or AIO or any of that junk :).
Anyway, that's just something that's been on my mind. If there are
obvious clean opportunities to get more done with single syscalls, it
might not be such a bad thing.
- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists