linux-kernel - Re: Proposal for "proper" durable fsync() and fdatasync()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20080226082725.b65365f0.akpm@linux-foundation.org>
Date:	Tue, 26 Feb 2008 08:27:25 -0800
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Jamie Lokier <jamie@...reable.org>
Cc:	Jörn Engel <joern@...fs.org>,
	Nick Piggin <nickpiggin@...oo.com.au>,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	Chris Wedgwood <cw@...f.org>
Subject: Re: Proposal for "proper" durable fsync() and fdatasync()

On Tue, 26 Feb 2008 15:07:45 +0000 Jamie Lokier <jamie@...reable.org> wrote:

> SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
> pages which aren't already queued for write-out.  It marks those with
> a "write-out" flag, and starts write I/Os at some unspecified time in
> the near future; it can be assumed writes for all the pages will
> complete eventually if there's no errors.  When I/O completes on a
> page, it cleans the page and also clears the write-out flag.
> 
> SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
> have the "write-out" flag set.
> 
> SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
> pages for write-out.  I don't actually see the point in this.  Isn't a
> preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
> BEFORE a redundant flag?

Consider the case of pages which are dirty but are already under writeout. 
ie: someone redirtied the page after someone started writing the page out. 
For these pages the kernel needs to

a) wait for the current writeout to complete

b) start new writeout

c) wait for that writeout to complete.

those are the three stages of sync_file_range().  They are independently
selectable and various combinations provide various results.

The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
userspace can get as much data into the queue as possible, to permit the
kernel to optimise IO scheduling better.

If you perform a) and b) together
(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed
that all data which was dirty when sync_file_range() executed will be sent
into the queue, but you won't get as much data into the queue if the kernel
encounters dirty, under-writeout pages.  This is especially hurtful if
you're trying to feed a lot of little segments into the queue.  In that
case perhaps userspace should do an asynchrnous pass
(SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then
a SYNC_FILE_RANGE_WAIT_AFTER pass then a
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
pass to clean up any stragglers.  WHich mode is best very much depends on
the application's file dirtying patterns.  One would have to experiment
with it, and tuning of sync_file_range() usage would occur alongside tuning
of the application's write() design.

It's an interesting problem, with potentially high payback.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/