linux-ext4 - Re: Could you help with move-on-write review?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <AANLkTikTnN3tJOZ9tL+MOzQuTwH8TgiL7awXzRoX392L@mail.gmail.com>
Date:	Thu, 25 Nov 2010 22:24:42 +0200
From:	Amir Goldstein <amir73il@...il.com>
To:	Theodore Tso <tytso@....edu>, Eric Sandeen <sandeen@...hat.com>,
	Lukas Czerner <lczerner@...hat.com>,
	Steven Whitehouse <swhiteho@...hat.com>
Cc:	Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: Could you help with move-on-write review?

On Thu, Nov 25, 2010 at 4:11 PM, Amir Goldstein <amir73il@...il.com> wrote:
> Hi guys,
>
> I would like to ask you as a big favor to review this patch,
> not as candidate for inclusion in mainline of course, but as a code/design
> review for my new move-on-write hooks. hopefully, some of this code will
> find it's way to ext4 some day soon.
>
> The problems with the old data hooks, which I may have mentioned, are:
> 1. second write to mmaped page did not call get_block() to trigger
>    move-on-write.
> 2. quota_write() called next3_bread(create=1) without notifying this
>    is a partial write and move-on-write has corrupted it's data,
>    resulting in broken quota file.
> 3. direct I/O was not supported.
>
> The new design uses 3 buffer head flags to signal get_block() about
> move-on-write:
> 1. buffer_move_data() is an explicit request to trigger move-on-write.
>     users which do not declare move_data (like quota_write) will not
>    trigger move-on-write. previous hooks would do move-on-write on
>    all regular files data without an explicit request.
>     new design suggests that we rather corrupt snapshot data
>    (missed move-on-write) then corrupt the file system live data
>    (false positive move-on-write).
> 2. buffer_partial_write() signals that in case of move-on-write, the
> old block data
>    needs to be read, before mapping the buffer to a new block.
> 3. buffer_direct_io() signals that in case of move-on-write or hole filling,
>    the buffer should not be mapped on return.
>
> There are 4 places I found that map data blocks and should trigger
> move-on-write:
> 1. write_begin() sets move_data and optionally partial_write flags and
>    unmaps buffers
> 2. ordered_writepage() sets move_data flag and unmaps buffers
>    (snapshots only work with data=ordered)
> 3. truncate_page() sets move_data flag (and calls next3_get_block() itself)
> 4. next3_get_block() sets direct_io and move_data flags when (create && !handle)
>
> Know issue:
> snapshot copy of quota files is not consistent with snapshot'ed
> file system (can be fixed, but is it relevant for read-only mount?)
>
> you should know that the core move-on-write code in
> get_blocks_handle() has not changed much
> (only the move_data and direct_io flag checks were added) and it has
> been used for quite some time now,
> so I do not expect to find major bugs in the block move mechanism itself.
>
> I revised my 'next3 test' scripts to rewrite random data to files via direct I/O
> and mmap to test the new move-on-write hooks in next3_get_block() and in
> ordered_writepage() and it all seems to work fine.
> quotas also seems to work fine now.
>
> I understand if this review is not on the top of your priorities,
> but if you can find the time for it, I would very much appreciate it.
>

And shortly thereafter comes v2...
tried to apply to kernel 2.6.32 and realized that DIO_SKIP_HOLES did not exist,
so I simplified the hook further by introducing a special
get_block_dio() function
to deal only with direct I/O writes:


===next3_snapshot_hooks_data.patch===

next3: snapshot hooks - move data blocks

Before every regular file data buffer write,
the function next3_get_block() is called to map the buffer to disk.
We use this hook to call the snapshot API snapshot_get_move_access(),
to optionally move the block to the snapshot file.

Signed-off-by: Amir Goldstein <amir73il@...rs.sf.net>

--------------------------------------------------------------------------------
diff -Nuarp a/fs/next3/inode.c b/fs/next3/inode.c
--- a/fs/next3/inode.c	2010-11-25 21:58:39.000000000 +0200
+++ b/fs/next3/inode.c	2010-11-25 21:58:39.000000000 +0200
@@ -827,6 +827,43 @@ int next3_get_blocks_handle(handle_t *ha

 	partial = next3_get_branch(inode, depth, offsets, chain, &err);

+	if (!partial && create && buffer_move_data(bh_result)) {
+		BUG_ON(!next3_snapshot_should_move_data(inode));
+		first_block = le32_to_cpu(chain[depth - 1].key);
+		blocks_to_boundary = 0;
+		/* should move 1 data block to snapshot? */
+		err = next3_snapshot_get_move_access(handle, inode,
+				first_block, 0);
+		if (err)
+			/* do not map found block */
+			partial = chain + depth - 1;
+		if (err < 0)
+			/* cleanup the whole chain and exit */
+			goto cleanup;
+		if (buffer_direct_io(bh_result)) {
+			/* suppress direct I/O write to block that needs to be moved */
+			err = 0;
+			goto cleanup;
+		}
+		if (err > 0)
+			/* check again under truncate_mutex */
+			err = -EAGAIN;
+	}
+	if (partial && create && buffer_direct_io(bh_result)) {
+		/* suppress direct I/O write to holes */
+		loff_t end = ((iblock + maxblocks - 1) << inode->i_blkbits) + 1;
+		/*
+		 * we do not know the original write length, but it has to be at least
+		 * 1 byte into the last requested block. if the minimal length write
+		 * isn't going to extend i_size, we must be cautious and assume that
+		 * direct I/O is async and refuse to fill the hole.
+		 */
+		if (end <= inode->i_size) {
+			err = 0;
+			goto cleanup;
+		}
+	}
+
 	/* Simplest case - block found, no allocation needed */
 	if (!partial) {
 		first_block = le32_to_cpu(chain[depth - 1].key);
@@ -883,6 +920,20 @@ int next3_get_blocks_handle(handle_t *ha
 			partial--;
 		}
 		partial = next3_get_branch(inode, depth, offsets, chain, &err);
+		if (!partial &&  buffer_move_data(bh_result)) {
+			BUG_ON(!next3_snapshot_should_move_data(inode));
+			first_block = le32_to_cpu(chain[depth - 1].key);
+			blocks_to_boundary = 0;
+			/* should move 1 data block to snapshot? */
+			err = next3_snapshot_get_move_access(handle, inode,
+					first_block, 0);
+			if (err)
+				/* re-allocate 1 data block */
+				partial = chain + depth - 1;
+			if (err < 0)
+				/* cleanup the whole chain and exit */
+				goto out_mutex;
+		}
 		if (!partial) {
 			count++;
 			mutex_unlock(&ei->truncate_mutex);
@@ -919,6 +970,43 @@ int next3_get_blocks_handle(handle_t *ha
 	if (err)
 		goto out_mutex;

+	if (*(partial->p)) {
+		int ret;
+
+		/* old block is being replaced with a new block */
+		if (buffer_partial_write(bh_result) &&
+				!buffer_uptodate(bh_result)) {
+			/* read old block data before moving it to snapshot */
+			map_bh(bh_result, inode->i_sb,
+					le32_to_cpu(*(partial->p)));
+			ll_rw_block(READ, 1, &bh_result);
+			wait_on_buffer(bh_result);
+			/* clear old block mapping */
+			clear_buffer_mapped(bh_result);
+			if (!buffer_uptodate(bh_result)) {
+				err = -EIO;
+				goto out_mutex;
+			}
+		}
+
+		if (buffer_partial_write(bh_result))
+			/* prevent zero out of page in block_write_begin() */
+			SetPageUptodate(bh_result->b_page);
+
+		/* move old block to snapshot */
+		ret = next3_snapshot_get_move_access(handle, inode,
+				le32_to_cpu(*(partial->p)), 1);
+		if (ret < 1) {
+			/* failed to move to snapshot - free new block */
+			next3_free_blocks(handle, inode,
+					le32_to_cpu(partial->key), 1);
+			err = ret ? : -EIO;
+			goto out_mutex;
+		}
+		/* block moved to snapshot - continue to splice new block */
+		err = 0;
+	}
+
 	/*
 	 * The next3_splice_branch call will free and forget any buffers
 	 * on the new chain if there is a failure, but that risks using
@@ -953,6 +1041,23 @@ out:
 	return err;
 }

+/* Simple get block for everything except direct I/O write */
+static int next3_get_block(struct inode *inode, sector_t iblock,
+			struct buffer_head *bh_result, int create)
+{
+	handle_t *handle = next3_journal_current_handle();
+	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
+	int ret;
+
+	ret = next3_get_blocks_handle(handle, inode, iblock,
+					max_blocks, bh_result, create);
+	if (ret > 0) {
+		bh_result->b_size = (ret << inode->i_blkbits);
+		ret = 0;
+	}
+	return ret;
+}
+
 /* Maximum number of blocks we map for direct IO at once. */
 #define DIO_MAX_BLOCKS 4096
 /*
@@ -964,13 +1069,31 @@ out:
  */
 #define DIO_CREDITS 25

-static int next3_get_block(struct inode *inode, sector_t iblock,
+static int next3_get_block_dio(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create)
 {
 	handle_t *handle = next3_journal_current_handle();
 	int ret = 0, started = 0;
 	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;

+	BUG_ON(handle != NULL);
+	if (NEXT3_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+				NEXT3_FEATURE_RO_COMPAT_HAS_SNAPSHOT)) {
+		/*
+		 * DIO_SKIP_HOLES may ask to map direct I/O writes with create=0.
+		 * We need to change it to create=1, so that we can fall back to
+		 * buffered I/O when data blocks need to be moved to snapshot.
+		 */
+		create = 1;
+		/*
+		 * signal next3_get_blocks_handle() to return unmapped block if block
+		 * is not allocated or if it needs to be moved to snapshot.
+		 */
+		set_buffer_direct_io(bh_result);
+		if (next3_snapshot_should_move_data(inode))
+			set_buffer_move_data(bh_result);
+	}
+
 	if (create && !handle) {	/* Direct IO write... */
 		if (max_blocks > DIO_MAX_BLOCKS)
 			max_blocks = DIO_MAX_BLOCKS;
@@ -1166,6 +1289,71 @@ static void next3_truncate_failed_write(
 	next3_truncate(inode);
 }

+/*
+ * Check if a buffer was written since the last snapshot was taken.
+ * In data=ordered, the only mode supported by next3, all dirty data buffers
+ * are flushed on snapshot take via freeze_fs() API, so buffer_jbd(bh) means
+ * that, the buffer was declared dirty data after snapshot take.
+ */
+static int buffer_first_write(handle_t *handle, struct buffer_head *bh)
+{
+	return !buffer_jbd(bh);
+}
+
+static int set_move_data(handle_t *handle, struct buffer_head *bh)
+{
+	BUG_ON(buffer_move_data(bh));
+	clear_buffer_mapped(bh);
+	set_buffer_move_data(bh);
+	return 0;
+}
+
+static int set_partial_write(handle_t *handle, struct buffer_head *bh)
+{
+	BUG_ON(buffer_partial_write(bh));
+	set_buffer_partial_write(bh);
+	return 0;
+}
+
+static void set_page_move_data(struct page *page, unsigned from, unsigned to)
+{
+	struct buffer_head *page_bufs = page_buffers(page);
+
+	BUG_ON(!page_has_buffers(page));
+	/*
+	 * make sure that get_block() is called even for mapped buffers,
+	 * but not if all buffers were written since last snapshot take.
+	 */
+	if (walk_page_buffers(NULL, page_bufs, from, to,
+				NULL, buffer_first_write)) {
+		/* signal get_block() to move-on-write */
+		walk_page_buffers(NULL, page_bufs, from, to,
+				NULL, set_move_data);
+		if (from > 0 || to < PAGE_CACHE_SIZE)
+			/* signal get_block() to update page before move-on-write */
+			walk_page_buffers(NULL, page_bufs, from, to,
+					NULL, set_partial_write);
+	}
+}
+
+static int clear_move_data(handle_t *handle, struct buffer_head *bh)
+{
+	clear_buffer_partial_write(bh);
+	clear_buffer_move_data(bh);
+	return 0;
+}
+
+static void clear_page_move_data(struct page *page)
+{
+	/*
+	 * partial_write/move_data flags are used to pass the move data block
+	 * request to next3_get_block() and should be cleared at all other times.
+	 */
+	BUG_ON(!page_has_buffers(page));
+	walk_page_buffers(NULL, page_buffers(page), 0, PAGE_CACHE_SIZE,
+			NULL, clear_move_data);
+}
+
 static int next3_write_begin(struct file *file, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata)
@@ -1198,8 +1386,21 @@ retry:
 		ret = PTR_ERR(handle);
 		goto out;
 	}
+	/*
+	 * only data=ordered mode is supported with snapshots, so the
+	 * buffer heads are going to be attached sooner or later anyway.
+	 */
+	if (!page_has_buffers(page))
+		create_empty_buffers(page, inode->i_sb->s_blocksize, 0);
+	/*
+	 * Check if blocks need to be moved-on-write. if they do, unmap buffers
+	 * and call block_write_begin() to remap them.
+	 */
+	if (next3_snapshot_should_move_data(inode))
+		set_page_move_data(page, from, to);
 	ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
 							next3_get_block);
+	clear_page_move_data(page);
 	if (ret)
 		goto write_begin_failed;

@@ -1546,6 +1747,12 @@ static int next3_ordered_writepage(struc
 				(1 << BH_Dirty)|(1 << BH_Uptodate));
 		page_bufs = page_buffers(page);
 	} else {
+		/*
+		 * Check if blocks need to be moved-on-write. if they do, unmap buffers
+		 * and fall through to get_block() path.
+		 */
+		if (next3_snapshot_should_move_data(inode))
+			set_page_move_data(page, 0, PAGE_CACHE_SIZE);
 		page_bufs = page_buffers(page);
 		if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
 				       NULL, buffer_unmapped)) {
@@ -1565,6 +1772,7 @@ static int next3_ordered_writepage(struc
 			PAGE_CACHE_SIZE, NULL, bget_one);

 	ret = block_write_full_page(page, next3_get_block, wbc);
+	clear_page_move_data(page);

 	/*
 	 * The page can become unlocked at any point now, and
@@ -1778,7 +1986,8 @@ static ssize_t next3_direct_IO(int rw, s
 retry:
 	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
 				 offset, nr_segs,
-				 next3_get_block, NULL);
+				 (rw == WRITE) ? next3_get_block_dio : next3_get_block,
+				 NULL);
 	if (ret == -ENOSPC && next3_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;

@@ -1964,6 +2173,21 @@ static int next3_block_truncate_page(han
 			goto unlock;
 	}

+	/* check if block needs to be moved to snapshot before zeroing */
+	if (next3_snapshot_should_move_data(inode) &&
+			buffer_first_write(NULL, bh)) {
+		set_buffer_move_data(bh);
+		err = next3_get_block(inode, iblock, bh, 1);
+		clear_buffer_move_data(bh);
+		if (err)
+			goto unlock;
+		if (buffer_new(bh)) {
+			unmap_underlying_metadata(bh->b_bdev,
+					bh->b_blocknr);
+			clear_buffer_new(bh);
+		}
+	}
+
 	if (next3_should_journal_data(inode)) {
 		BUFFER_TRACE(bh, "get write access");
 		err = next3_journal_get_write_access(handle, bh);
diff -Nuarp a/fs/next3/snapshot.h b/fs/next3/snapshot.h
--- a/fs/next3/snapshot.h	2010-11-25 21:58:40.000000000 +0200
+++ b/fs/next3/snapshot.h	2010-11-25 21:58:39.000000000 +0200
@@ -97,6 +97,15 @@
 #define SNAPSHOT_SET_DISABLED(inode)		\
 	i_size_write((inode), 0)

+enum next3_bh_state_bits {
+	BH_Partial_Write = 29,	/* Buffer should be uptodate before write */
+	BH_Direct_IO = 30,		/* Buffer is under direct I/O */
+	BH_Move_Data = 31,		/* Data block may need to be moved-on-write */
+};
+
+BUFFER_FNS(Partial_Write, partial_write)
+BUFFER_FNS(Direct_IO, direct_io)
+BUFFER_FNS(Move_Data, move_data)


 #define next3_snapshot_cow(handle, inode, bh, cow) 0
@@ -158,6 +167,31 @@ static inline int next3_snapshot_get_cre
 }

 /*
+ * get_move_access() - move block to snapshot
+ * @handle:	JBD handle
+ * @inode:	owner of @block
+ * @block:	address of @block
+ * @move:	if false, only test if @block needs to be moved
+ *
+ * Called from next3_get_blocks_handle() before overwriting a data block,
+ * when buffer_move() is true.  Specifically, only data blocks of
regular files,
+ * whose data is not being journaled are moved on full page write.
+ * Journaled data blocks are COWed on get_write_access().
+ * Snapshots and excluded files blocks are never moved-on-write.
+ * If @move is true, then truncate_mutex is held.
+ *
+ * Return values:
+ * = 1 - @block was moved or may not be overwritten
+ * = 0 - @block may be overwritten
+ * < 0 - error
+ */
+static inline int next3_snapshot_get_move_access(handle_t *handle,
+		struct inode *inode, next3_fsblk_t block, int move)
+{
+	return next3_snapshot_move(handle, inode, block, 1, move);
+}
+
+/*
  * get_delete_access() - move count blocks to snapshot
  * @handle:	JBD handle
  * @inode:	owner of blocks
@@ -216,6 +250,19 @@ extern next3_fsblk_t next3_get_inode_blo



+/*
+ * check if the data blocks of @inode should be moved-on-write
+ */
+static inline int next3_snapshot_should_move_data(struct inode *inode)
+{
+	if (!NEXT3_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+				NEXT3_FEATURE_RO_COMPAT_HAS_SNAPSHOT))
+		return 0;
+	/* when a data block is journaled, it is already COWed as metadata */
+	if (next3_should_journal_data(inode))
+		return 0;
+	return 1;
+}
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html