[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5c49b0ed0701140111p6fd2d60at7ec246738b887a73@mail.gmail.com>
Date: Sun, 14 Jan 2007 01:11:08 -0800
From: "Nate Diller" <nate.diller@...il.com>
To: "Andrew Morton" <akpm@...l.org>
Cc: andersen@...epoet.org, "Linus Torvalds" <torvalds@...l.org>,
"Michael Tokarev" <mjt@....msk.ru>,
"Chris Mason" <chris.mason@...cle.com>,
"dean gaudet" <dean@...tic.org>, Viktor <vvp01@...ox.ru>,
Aubrey <aubreylee@...il.com>, "Hua Zhong" <hzhong@...il.com>,
"Hugh Dickins" <hugh@...itas.com>, linux-kernel@...r.kernel.org,
hch@...radead.org, kenneth.w.chen@...el.com
Subject: Re: O_DIRECT question
On 1/12/07, Andrew Morton <akpm@...l.org> wrote:
> On Fri, 12 Jan 2007 15:35:09 -0700
> Erik Andersen <andersen@...epoet.org> wrote:
>
> > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote:
> > > I suspect a lot of people actually have other reasons to avoid caches.
> > >
> > > For example, the reason to do O_DIRECT may well not be that you want to
> > > avoid caching per se, but simply because you want to limit page cache
> > > activity. In which case O_DIRECT "works", but it's really the wrong thing
> > > to do. We could export other ways to do what people ACTUALLY want, that
> > > doesn't have the downsides.
> >
> > I was rather fond of the old O_STREAMING patch by Robert Love,
>
> That was an akpmpatch whcih I did for the Digeo kernel. Robert picked it
> up to dehackify it and get it into mainline, but we ended up deciding that
> posix_fadvise() was the way to go because it's standards-based.
>
> It's a bit more work in the app to use posix_fadvise() well. But the
> results will be better. The app should also use sync_file_range()
> intelligently to control its pagecache use.
and there's an interesting note that i should add here, cause there's
a downside to using fadvise() instead of O_STREAM when the programmer
is not careful. I spent at least a month doing some complex blktrace
analysis to try to figure out why Digeo's new platform (which used the
fadvise() call) didn't have the kind of streaming performance that it
should have. One symptom I found was that even on the media partition
where I/O should have always been happening in nice 512K chunks
(ra_pages == 128), it seemed to be happening in random values between
32K and 512K. It turns out that the code pulls in some size chunk,
maybe 32K, then does an fadvise DONTNEED on the fd, *with zero offset
and zero length*, meaning that it wipes out *all* the pagecache for
the file. That means that the rest of the 512K from the readahead
would get discarded before it got used, and later the remaining pages
in the ra window would get faulted in again.
Most applications don't get the kind of performance analysis that
Digeo was doing, and even then, it's rather lucky that we caught that.
So I personally think it'd be best for libc or something to simulate
the O_STREAM behavior if you ask for it. That would simplify things
for the most common case, and have the side benefit of reducing the
amount of extra code an application would need in order to take
advantage of that feature.
NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists