[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0701120955440.3594@woody.osdl.org>
Date: Fri, 12 Jan 2007 10:06:22 -0800 (PST)
From: Linus Torvalds <torvalds@...l.org>
To: dean gaudet <dean@...tic.org>
cc: Viktor <vvp01@...ox.ru>, Aubrey <aubreylee@...il.com>,
Hua Zhong <hzhong@...il.com>, Hugh Dickins <hugh@...itas.com>,
linux-kernel@...r.kernel.org, hch@...radead.org,
kenneth.w.chen@...el.com, akpm@...l.org, mjt@....msk.ru
Subject: Re: O_DIRECT question
On Thu, 11 Jan 2007, dean gaudet wrote:
>
> it seems to me that if splice and fadvise and related things are
> sufficient for userland to take care of things "properly" then O_DIRECT
> could be changed into splice/fadvise calls either by a library or in the
> kernel directly...
The problem is two-fold:
- the fact that databases use O_DIRECT and all the commercial people are
perfectly happy to use a totally idiotic interface (and they don't care
about the problems) means that things like fadvice() don't actually
get the TLC. For example, the USEONCE thing isn't actually
_implemented_, even though from a design standpoint, it would in many
ways be preferable over O_DIRECT.
It's not just fadvise. It's a general problem for any new interfaces
where the old interfaces "just work" - never mind if they are nasty.
And O_DIRECT isn't actually all that nasty for users (although the
alignment restrictions are obviously irritating, but they are mostly
fundamental _hardware_ alignment restrictions, so..). It's only nasty
from a kernel internal security/serialization standpoint.
So in many ways, apps don't want to change, because they don't really
see the problems.
(And, as seen in this thread: uses like NFS don't see the problems
either, because there the serialization is done entirely somewhere
*else*, so the NFS people don't even understand why the whole interface
sucks in the first place)
- a lot of the reasons for problems for O_DIRECT is the semantics. If we
could easily implement the O_DIRECT semantics using something else, we
would. But it's semantically not allowed to steal the user page, and it
has to wait for it to be all done with, because those are the semantics
of "write()".
So one of the advantages of vmsplice() and friends is literally that it
could allow page stealing, and allow the semantics where any changes to
the page (in user space) might make it to disk _after_ vmsplice() has
actually already returned, because we literally re-use the page (ie
it's fundamentally an async interface).
But again, fadvise and vmsplice etc aren't even getting the attention,
because right now they are only used by small programs (and generally not
done by people who also work on the kernel, and can see that it really
would be better to use more natural interfaces).
> looking at the splice(2) api it seems like it'll be difficult to implement
> O_DIRECT pread/pwrite from userland using splice... so there'd need to be
> some help there.
You'd use vmsplice() to put the write buffers into kernel space (user
space sees it's a pipe file descriptor, but you should just ignore that:
it's really just a kernel buffer). And then splice the resulting kernel
buffers to the destination.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists