[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <117BE5D6-146E-407D-887E-067F212BA871@oracle.com>
Date:	Thu, 15 Feb 2007 11:16:18 -0800
From:	Zach Brown <zach.brown@...cle.com>
To:	bert hubert <bert.hubert@...herlabs.nl>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Evgeniy Polyakov <johnpol@....mipt.ru>,
	Ingo Molnar <mingo@...e.hu>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Arjan van de Ven <arjan@...radead.org>,
	Christoph Hellwig <hch@...radead.org>,
	Andrew Morton <akpm@....com.au>,
	Alan Cox <alan@...rguk.ukuu.org.uk>,
	Ulrich Drepper <drepper@...hat.com>,
	"David S. Miller" <davem@...emloft.net>,
	Benjamin LaHaise <bcrl@...ck.org>,
	Suparna Bhattacharya <suparna@...ibm.com>,
	Davide Libenzi <davidel@...ilserver.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [patch 05/11] syslets: core code
> 2) 	On the client facing side (port 53), I'd very much hope for a  
> way to
> 	do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
> 	UDP datagrams with only one kernel transition.
I want to highlight this point that Bert is making.
Whenever we talk about AIO and kernel threads some folks are rightly  
concerned that we're talking about a thread *per IO* and fear that  
memory consumption will be fatal.
Take the case of userspace which implements what we'd think of as  
page cache writeback.  (*coughs, points at email address*).  It wants  
to issue thousands of IOs to disjoint regions of a file.  "Thousands  
of kernel threads, oh crap!"
But it only issues each IO with a separate syscall (or io_submit()  
op) because it doesn't have an interface that lets it specify IOs  
that vector user memory addresses *and file position*.
If we had a seemingly obvious interface that let it kick off batched  
IOs to different parts of the file, the looming disaster of a thread  
per IO vanishes in that case.
struct off_vec {
	off_t pos;
	size_t len;
};
long sys_sgwrite(int fd, struct iovec *memvec, size_t mv_count,
	struct off_vec *ovec, size_t ov_count);
It doesn't take long to imagine other uses for this that are less  
exotic.
Take e2fsck and its iterating through indirect blocks or directory  
data blocks.  It has a list of disjoint file regions (blocks) it  
wants to read, but it does them serially to keep the code from  
getting even more confusing.  blktrace a clean e2fsck -f some time..  
it's leaving *HALF* of the disk read bandwith on the table by  
performing serial block-sized reads.  If it could specify batches of  
them the code would still be simple but it could tell the kernel and  
IO scheduler *exactly* what it wants, without having to mess around  
with sys_readahead() or AIO or any of that junk :).
Anyway, that's just something that's been on my mind.  If there are  
obvious clean opportunities to get more done with single syscalls, it  
might not be such a bad thing.
- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists
 
