[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070112202316.GA28400@think.oraclecorp.com>
Date: Fri, 12 Jan 2007 15:23:16 -0500
From: Chris Mason <chris.mason@...cle.com>
To: Linus Torvalds <torvalds@...l.org>
Cc: dean gaudet <dean@...tic.org>, Viktor <vvp01@...ox.ru>,
Aubrey <aubreylee@...il.com>, Hua Zhong <hzhong@...il.com>,
Hugh Dickins <hugh@...itas.com>, linux-kernel@...r.kernel.org,
hch@...radead.org, kenneth.w.chen@...el.com, akpm@...l.org,
mjt@....msk.ru
Subject: Re: O_DIRECT question
On Fri, Jan 12, 2007 at 10:06:22AM -0800, Linus Torvalds wrote:
> > looking at the splice(2) api it seems like it'll be difficult to implement
> > O_DIRECT pread/pwrite from userland using splice... so there'd need to be
> > some help there.
>
> You'd use vmsplice() to put the write buffers into kernel space (user
> space sees it's a pipe file descriptor, but you should just ignore that:
> it's really just a kernel buffer). And then splice the resulting kernel
> buffers to the destination.
I recently spent some time trying to integrate O_DIRECT locking with
page cache locking. The basic theory is that instead of using
semaphores for solving O_DIRECT vs buffered races, you put something
into the radix tree (I call it a placeholder) to keep the page cache
users out, and lock any existing pages that are present.
O_DIRECT does save cpu from avoiding copies, but it also saves cpu from
fewer radix tree operations during massive IOs. The cost of radix tree
insertion/deletion on 1MB O_DIRECT ios added ~10% system time on
my tiny little dual core box. I'm sure it would be much worse if there
was lock contention on a big numa machine, and it grows as the io grows
(SGI does massive O_DIRECT ios).
To help reduce radix churn, I made it possible for a single placeholder
entry to lock down a range in the radix:
http://thread.gmane.org/gmane.linux.file-systems/12263
It looks to me as though vmsplice is going to have the same issues as my
early patches. The current splice code can avoid the copy but is still
working in page sized chunks. Also, splice doesn't support zero copy on
things smaller than page sized chunks.
The compromise my patch makes is to hide placeholders from almost
everything except the DIO code. It may be worthwhile to turn the
placeholders into an IO marker that can be useful to filemap_fdatawrite
and friends.
It should be able to:
record the userland/kernel pages involved in a given io
map blocks from the FS for making a bio
start the io
wake people up when the io is done
This would allow splice to operate without stealing the userland page
(stealing would still be an option of course), and could get rid of big
chunks of fs/direct-io.c.
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists