[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090423211305.GN2723@mit.edu>
Date: Thu, 23 Apr 2009 17:13:05 -0400
From: Theodore Tso <tytso@....edu>
To: Jamie Lokier <jamie@...reable.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Valerie Aurora Henson <vaurora@...hat.com>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Chris Mason <chris.mason@...cle.com>,
Eric Sandeen <sandeen@...hat.com>,
Ric Wheeler <rwheeler@...hat.com>,
Nick Piggin <npiggin@...e.de>
Subject: Re: fsync_range_with_flags() - improving sync_file_range()
On Thu, Apr 23, 2009 at 09:44:11PM +0100, Jamie Lokier wrote:
> Yes that's the page I've read and didn't find useful :-)
> The data-locating metadata is explained thus:
>
> None of these operations write out the file’s metadata. Therefore,
> unless the application is strictly performing overwrites of already-
> instantiated disk blocks, there are no guarantees that the data will be
> available after a crash.
Well, I thought that was clear. Today, sync_file_range(2) only works
if the data-localting metadata is already on the disk. This is useful
for databases where the tablespace is allocated ahead of time, but not
much else.
> But a kernel thread from Feb 2008 revealed the truth:
> sync_file_range() _doesn't_ commit data on such filesystems.
Because we could very easily add a flag which would cause it to commit
the data-locating metadata blocks --- or maybe we change it so that it
does commit the data-locating metadata, on the assumption that if the
data-locating metadata is already committed, which would be true for
all of its existing users, it's a no-op, and if it isn't, we should
just comit the data-locating metadata and add a call from the existing
implementation to a filesystem-provided method function.
> So sync_file_range() is basically useless as a data integrity
> operation. It's not a substitute for fdatasync(). Therefore why
> would you ever use it?
It's not useful *today*. But we could make it useful. The power of
the existing bit flags is useful, although granted it can be confusing
for the users who aren't haven't meditated deeply upon the writeback
code paths. I thought it was clear, but if it isn't we can improve
the documentation.
More to the point, given that we already have sync_file_range(2), I
would argue that it would be unfortunate to create a new system call
that has overlapping functionality but which is not a superset of
sync_file_range(2). Maybe Nick has a good reason for starting with an
entirely new system call, but if so, it would be nice if it at least
have the power of sync_file_range(2), in addition to having new
functionality.
> > But the interface does make a lot of sense. (But maybe that's because
> > I've spent too much time staring at all of the page writeback call
> > paths, and compared to that even string theory is pretty simple. :-)
>
> Yeah, sounds like you have studied both and gained the proper perspective :-)
>
> I suspect all the fsync-related uncertainty about whether it really
> works, including interactions with filesystem quirks, reliable and
> potential bugs in filesystems, would be much easier to get right if we
> only had a way to repeatably test it.
The answer today is sync_file_range(2) is purely a creature of the MM
subsystem, and doesn't do anything with respect to filesystem metadata
or barriers. Once you understand that, the rest of the man page is
pretty simple, I think. :-)
Whether or not it should *continue* to be that way in the future is a
different discussion, of course.
> I'm thinking running a kernel inside a VM invoked and
> stopped/killed/branched is the only realistic way to test that all
> data is committed properly, with/without necessary I/O barriers, and
> recovers properly after a crash and resume. Fortunately we have good
> VMs now, such a test seems very doable. It would help with testing
> journalling & recovery behaviour too.
>
> Is there such a test or related tool already?
I don't know of one. I agree it would be a useful thing to have. It
won't test barriers at the driver level, but it would be good for
testing the everything above that.
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists