lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 23 Apr 2009 17:13:05 -0400
From:	Theodore Tso <tytso@....edu>
To:	Jamie Lokier <jamie@...reable.org>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Valerie Aurora Henson <vaurora@...hat.com>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	Chris Mason <chris.mason@...cle.com>,
	Eric Sandeen <sandeen@...hat.com>,
	Ric Wheeler <rwheeler@...hat.com>,
	Nick Piggin <npiggin@...e.de>
Subject: Re: fsync_range_with_flags() - improving sync_file_range()

On Thu, Apr 23, 2009 at 09:44:11PM +0100, Jamie Lokier wrote:
> Yes that's the page I've read and didn't find useful :-)
> The data-locating metadata is explained thus:
> 
>      None  of  these  operations  write out the file’s metadata.  Therefore,
>      unless the application is strictly performing  overwrites  of  already-
>      instantiated disk blocks, there are no guarantees that the data will be
>      available after a crash.

Well, I thought that was clear.  Today, sync_file_range(2) only works
if the data-localting metadata is already on the disk.  This is useful
for databases where the tablespace is allocated ahead of time, but not
much else.

> But a kernel thread from Feb 2008 revealed the truth:
> sync_file_range() _doesn't_ commit data on such filesystems.

Because we could very easily add a flag which would cause it to commit
the data-locating metadata blocks --- or maybe we change it so that it
does commit the data-locating metadata, on the assumption that if the
data-locating metadata is already committed, which would be true for
all of its existing users, it's a no-op, and if it isn't, we should
just comit the data-locating metadata and add a call from the existing
implementation to a filesystem-provided method function.

> So sync_file_range() is basically useless as a data integrity
> operation.  It's not a substitute for fdatasync().  Therefore why
> would you ever use it?

It's not useful *today*.  But we could make it useful.  The power of
the existing bit flags is useful, although granted it can be confusing
for the users who aren't haven't meditated deeply upon the writeback
code paths.  I thought it was clear, but if it isn't we can improve
the documentation.

More to the point, given that we already have sync_file_range(2), I
would argue that it would be unfortunate to create a new system call
that has overlapping functionality but which is not a superset of
sync_file_range(2).  Maybe Nick has a good reason for starting with an
entirely new system call, but if so, it would be nice if it at least
have the power of sync_file_range(2), in addition to having new
functionality.

> > But the interface does make a lot of sense.  (But maybe that's because
> > I've spent too much time staring at all of the page writeback call
> > paths, and compared to that even string theory is pretty simple.  :-)
> 
> Yeah, sounds like you have studied both and gained the proper perspective :-)
> 
> I suspect all the fsync-related uncertainty about whether it really
> works, including interactions with filesystem quirks, reliable and
> potential bugs in filesystems, would be much easier to get right if we
> only had a way to repeatably test it.

The answer today is sync_file_range(2) is purely a creature of the MM
subsystem, and doesn't do anything with respect to filesystem metadata
or barriers.  Once you understand that, the rest of the man page is
pretty simple, I think.  :-)   

Whether or not it should *continue* to be that way in the future is a
different discussion, of course.

> I'm thinking running a kernel inside a VM invoked and
> stopped/killed/branched is the only realistic way to test that all
> data is committed properly, with/without necessary I/O barriers, and
> recovers properly after a crash and resume.  Fortunately we have good
> VMs now, such a test seems very doable.  It would help with testing
> journalling & recovery behaviour too.
> 
> Is there such a test or related tool already?

I don't know of one.  I agree it would be a useful thing to have.  It
won't test barriers at the driver level, but it would be good for
testing the everything above that.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ