linux-kernel - Re: [PATCH] fsync: wait for data writeout completion before calling ->fsync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090903104704.GD5060@duck.novell.com>
Date:	Thu, 3 Sep 2009 12:47:04 +0200
From:	Jan Kara <jack@...e.cz>
To:	Christoph Hellwig <hch@....de>
Cc:	Jan Kara <jack@...e.cz>, linux-fsdevel@...r.kernel.org,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] fsync: wait for data writeout completion before
	calling ->fsync

On Thu 03-09-09 00:18:38, Christoph Hellwig wrote:
> I think we should add this one ontop:
  Agreed. Added to the series. I'll now push the whole series to linux-next
via my fs tree.

								Honza

> 
> -- 
> Subject: [PATCH] fsync: wait for data writeout completion before calling ->fsync
> From: Christoph Hellwig <hch@....de>
> 
> Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
> the data, the calls into ->fsync to write out the metadata and then finally
> calls filemap_fdatawait to wait for the data I/O to complete.  What sounds
> like a clever micro-optimization actually is nast trap for many filesystems.
> 
> For many modern filesystems i_size or other inode information is only
> updated on I/O completion and we need to wait for I/O to finish before
> we can write out the metadata.  For old fashionen filesystems that
> instanciate blocks during the actual write and also update the metadata
> at that point it opens up a large window were we could expose uninitialized
> blocks after a crash.  While a few filesystems that need it already wait
> for the I/O to finish inside their ->fsync methods it is rather suboptimal
> as it is done under the i_mutex and also always for the whole file instead
> of just a part as we could do for O_SYNC handling.
> 
> Here is a small audit of all fsync instances in the tree:
> 
>  - spufs_mfc_fsync:
>  - ps3flash_fsync: 
>  - vol_cdev_fsync:
>  - printer_fsync:
>  - fb_deferred_io_fsync:
>  - bad_file_fsync:
>  - simple_sync_file:
> 
> 	don't care - filesystems/drivers do't use the page cache or are
> 	purely in-memory.
> 
>  - simple_fsync:
>  - file_fsync:
>  - affs_file_fsync:
>  - fat_file_fsync:
>  - jfs_fsync:
>  - ubifs_fsync:
>  - reiserfs_dir_fsync:
>  - reiserfs_sync_file:
> 
> 	never touch pagecache themselves.  We need to wait before if we do
> 	not want to expose stale data after an allocation.
> 
>  - afs_fsync:
>  - fuse_fsync_common:
> 
> 	do the waiting writeback itself in awkward ways, would benefit from
> 	proper semantics
> 
>  - block_fsync:
> 
> 	Does a filemap_write_and_wait on the block device inode.  Because we
> 	now have f_mapping that is the same inode we call it on in vfs_fsync.
> 	So just removing it and letting the VFS do the work in one go would
> 	be an improvement.
> 
>  - btrfs_sync_file:
>  - cifs_fsync:
>  - xfs_file_fsync:
> 
> 	need the wait first and currently do it themselves. would benefit from
> 	doing it outside i_mutex.
> 
>  - coda_fsync:
>  - ecryptfs_fsync:
>  - exofs_file_fsync:
>  - shm_fsync:
> 
> 	only passes the fsync through to the lower layer
> 
>  - ext3_sync_file:
> 
> 	doesn't seem to care, comments are confusing.
> 
>  - ext4_sync_file:
> 
> 	would need the wait to work correctly for delalloc mode with late
> 	i_size updates.  Otherwise the ext3 comment applies.
> 
> 
> 	currently implemens it's own writeback and wait in an odd way,
> 	could benefit from doing it properly.
> 
>  - gfs2_fsync:
> 
> 	not needed for journaled data mode, but probably harmless there.
> 	Currently writes back data asynchronously itself.  Needs some
> 	major audit.
> 
>  - hostfs_fsync:
> 
> 	just calls fsync/datasync on the host FD.  Without the wait before
> 	data might not even be inflight yet if we're unlucky.
> 
>  - hpfs_file_fsync:
>  - ncp_fsync:
> 
> 	no-ops.  Dangerous before and after.
> 
>  - jffs2_fsync:
> 
> 	just calls jffs2_flush_wbuf_gc, not sure how this relates to data.
> 
>  - nfs_fsync_dir:
> 
> 	just increments stats, claims all directory operations are synchronous
> 
>  - nfs_file_fsync:
> 
> 	only writes out data???  Looks very odd.
> 
>  - nilfs_sync_file:
> 
> 	looks like it expects all data done, but not sure from the code
> 
>  - ntfs_dir_fsync:
>  - ntfs_file_fsync:
> 
> 	appear to do their own data writeback.  Very convoluted code.
> 
>  - ocfs2_sync_file:
> 
> 	does it's own data writeback, but no wait.  probably needs the wait.
> 
>  - smb_fsync:
> 
> 	according to a comment expects all pages written already, probably needs
> 	the wait before.
> 
> 
> This patch only changes vfs_fsync_range, removal of the wait in the methods
> that have it is left to the filesystem maintainers.  Note that most
> filesystems really do need an audit for their fsync methods given the
> gems found in this very brief audit.
> 
> Signed-off-by: Christoph Hellwig <hch@....de>
> 
> Index: linux-2.6/fs/sync.c
> ===================================================================
> --- linux-2.6.orig/fs/sync.c	2009-09-02 15:03:41.073271287 -0300
> +++ linux-2.6/fs/sync.c	2009-09-02 15:04:34.401269249 -0300
> @@ -216,7 +216,7 @@ int vfs_fsync_range(struct file *file, s
>  		goto out;
>  	}
>  
> -	ret = filemap_fdatawrite_range(mapping, start, end);
> +	ret = filemap_write_and_wait_range(mapping, start, end);
>  
>  	/*
>  	 * We need to protect against concurrent writers, which could cause
> @@ -228,9 +228,6 @@ int vfs_fsync_range(struct file *file, s
>  		ret = err;
>  	mutex_unlock(&mapping->host->i_mutex);
>  
> -	err = filemap_fdatawait_range(mapping, start, end);
> -	if (!ret)
> -		ret = err;
>  out:
>  	return ret;
>  }
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/