[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4C3156D0.5090001@oracle.com>
Date: Mon, 05 Jul 2010 11:51:44 +0800
From: Tao Ma <tao.ma@...cle.com>
To: Dave Chinner <david@...morbit.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Linux Kernel <linux-kernel@...r.kernel.org>,
ocfs2-devel@....oracle.com, Dave Chinner <dchinner@...hat.com>,
Christoph Hellwig <hch@....de>, Mark Fasheh <mfasheh@...e.com>,
Joel Becker <joel.becker@...cle.com>
Subject: Re: [PATCH 1/2] ocfs2: Zero the tail cluster when extending past
i_size v2
Hi Joel,
On 07/04/2010 05:32 AM, Joel Becker wrote:
> Here's the second version of my corruption fix. It fixes two
> bugs:
>
> 1) i_size can obviously be at a place that is a hole, so don't BUG on
> that.
> 2) Fix an off-by-one when checking whether the write position is within
> the tail allocation.
>
> This version passes my tail corruption test as well as the kernel
> compile that exposed the two bugs above.
>
> Joel
>
> ---------------------------------------------------------------
>
> ocfs2's allocation unit is the cluster. This can be larger than a block
> or even a memory page. This means that a file may have many blocks in
> its last extent that are beyond the block containing i_size.
>
> When ocfs2 grows a file, it zeros the entire cluster in order to ensure
> future i_size growth will see cleared blocks. Unfortunately,
> block_write_full_page() drops the pages past i_size. This means that
> ocfs2 is actually leaking garbage data into the tail end of that last
> cluster.
>
> We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
> when a write or truncate is past i_size. If there is any existing
> allocation between the block containing the current i_size and the
> location of the write or truncate, zeros will be written to that
> allocation.
>
> This is only for sparse filesystems. Non-sparse filesystems already get
> this via ocfs2_extend_no_holes().
>
> Signed-off-by: Joel Becker<joel.becker@...cle.com>
> ---
> fs/ocfs2/aops.c | 22 ++++----
> fs/ocfs2/file.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------
> fs/ocfs2/file.h | 2 +
> 3 files changed, 150 insertions(+), 28 deletions(-)
>
<snip>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6a13ea6..7fca78d 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -848,6 +848,137 @@ out:
> return ret;
> }
>
> +/*
> + * This function is a helper for ocfs2_zero_tail(). It calculates
> + * what blocks need zeroing and does any CoW necessary.
> + */
> +static int ocfs2_zero_tail_prepare(struct inode *inode,
> + struct buffer_head *di_bh,
> + loff_t pos, u64 *start_blkno,
> + u64 *blocks)
> +{
> + int rc = 0;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + u32 tail_cpos, pos_cpos, p_cpos;
> + u64 tail_blkno, pos_blkno, blocks_to_zero;
> + unsigned int num_clusters = 0;
> + unsigned int ext_flags = 0;
> +
> + /*
> + * The block containing i_size has already been zeroed, so our tail
> + * block is the first block after i_size. The block containing
> + * pos will be zeroed. So we only need to do anything if
> + * tail_blkno is before pos_blkno.
> + */
> + tail_blkno = (i_size_read(inode)>> inode->i_sb->s_blocksize_bits) + 1;
> + pos_blkno = pos>> inode->i_sb->s_blocksize_bits;
> + mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n",
> + (unsigned long long)tail_blkno, (unsigned long long)pos_blkno);
> + if (pos_blkno<= tail_blkno)
> + goto out;
> + blocks_to_zero = pos_blkno - tail_blkno;
> +
> + /*
> + * If tail_blkno is in the cluster past i_size, we don't need
> + * to touch the cluster containing i_size at all.
> + */
> + tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
> + if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
> + tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
> + tail_blkno);
Can we always set tail_cpos in one line?
tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
tail_cpos is either the same cluster as i_size or the next cluster and
both works for tail_blkno I guess?
> +
> + rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters,
> + &ext_flags);
> + if (rc) {
> + mlog_errno(rc);
> + goto out;
> + }
> +
> + /* Is there a cluster to zero? */
> + if (!p_cpos)
> + goto out;
For unwritten extent, we also need to clear the pages? If yes, the
solution doesn't complete if we have 2 unwritten extent, one contains
i_size while one passes i_size. Here we only clear the pages for the 1st
unwritten extent and leave the 2nd one untouched.
> +
> + pos_cpos = pos>> osb->s_clustersize_bits;
> + mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n",
> + (unsigned int)tail_cpos, (unsigned int)num_clusters,
> + (unsigned int)pos_cpos, (unsigned long long)tail_blkno,
> + (unsigned long long)pos_blkno);
From here to the call of CoW is a bit hard to understand. In 'if',
num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it
isn't easy for the reader to tell why these 2 clauses are setting
different values. So how about my code below? It looks more
straightforward I think.
> + if ((tail_cpos + num_clusters)> pos_cpos) {
> + num_clusters = pos_cpos - tail_cpos;
> + if (pos_blkno>
> + ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
> + num_clusters += 1;
> + } else {
> + blocks_to_zero =
> + ocfs2_clusters_to_blocks(inode->i_sb,
> + tail_cpos + num_clusters);
> + blocks_to_zero -= tail_blkno;
> + }
> +
> + /* Now CoW the clusters we're about to zero */
> + if (ext_flags& OCFS2_EXT_REFCOUNTED) {
> + rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
> + num_clusters, UINT_MAX);
> + if (rc) {
> + mlog_errno(rc);
> + goto out;
> + }
> + }
/* Decrease blocks_to_zero if there is some hole after extent */
if (tail_cpos + num_clusters <= pos_cpos) {
blocks_to_zero =
ocfs2_clusters_to_blocks(inode->i_sb,
tail_cpos + num_clusters);
blocks_to_zero -= tail_blkno;
}
/* Now CoW if we have some refcounted clusters. */
if (ext_flags & OCFS2_EXT_REFCOUNTED) {
/*
* We add one more cluster here since it will be
* written shortly and if the pos_blkno isn't aligned
* to the cluster size, we have to zero the blocks
* before it.
*/
if (tail_cpos + num_clusters > pos_cpos)
num_clusters = pos_cpos - tail_cpos + 1;
rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
num_clusters, UINT_MAX);
if (rc) {
mlog_errno(rc);
goto out;
}
}
> +
> + *start_blkno = tail_blkno;
> + *blocks = blocks_to_zero;
> + mlog(0, "start_blkno = %llu, blocks = %llu\n",
> + (unsigned long long)(*start_blkno),
> + (unsigned long long)(*blocks));
> +
> +out:
> + return rc;
> +}
Regards,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists