[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxjef-LrZvJkhw=2HvUN6UGtteW30gNUi2yU3LPP_oQhzw@mail.gmail.com>
Date: Sat, 23 Jul 2011 16:21:55 +0300
From: Amir Goldstein <amir73il@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Ted Tso <tytso@....edu>, linux-ext4@...r.kernel.org
Subject: Re: [PATCH] ext4: Fix data corruption in inodes with journalled data
On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara <jack@...e.cz> wrote:
> When journalling data for an inode (either because it is a symlink or
> because the filesystem is mounted in data=journal mode), ext4_evict_inode()
> can discard unwritten data by calling truncate_inode_pages(). This is
> because we don't mark the buffer / page dirty when journalling data but only
> add the buffer to the running transaction and thus mm does not know there
> are still unwritten data.
>
> Fix the problem by carefully tracking transaction containing inode's data,
> committing this transaction, and writing uncheckpointed buffers when inode
> should be reaped.
>
> Signed-off-by: Jan Kara <jack@...e.cz>
> ---
> fs/ext4/inode.c | 29 +++++++++++++++++++++++++++++
> 1 files changed, 29 insertions(+), 0 deletions(-)
>
> This is ext4 version of an ext3 fix I sent a while ago. It received only
> light testing but I figured you might want get the patch earlier rather than
> later given the merge window is open.
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e3126c0..019995b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -190,6 +190,33 @@ void ext4_evict_inode(struct inode *inode)
>
> trace_ext4_evict_inode(inode);
> if (inode->i_nlink) {
> + /*
> + * When journalling data dirty buffers are tracked only in the
> + * journal. So although mm thinks everything is clean and
> + * ready for reaping the inode might still have some pages to
> + * write in the running transaction or waiting to be
> + * checkpointed. Thus calling jbd2_journal_invalidatepage()
> + * (via truncate_inode_pages()) to discard these buffers can
> + * cause data loss. Also even if we did not discard these
> + * buffers, we would have no way to find them after the inode
> + * is reaped and thus user could see stale data if he tries to
> + * read them before the transaction is checkpointed. So be
> + * careful and force everything to disk here... We use
> + * ei->i_datasync_tid to store the newest transaction
> + * containing inode's data.
> + *
> + * Note that directories do not have this problem because they
> + * don't use page cache.
> + */
> + if (ext4_should_journal_data(inode) &&
> + (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) {
> + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> + tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
> +
> + jbd2_log_start_commit(journal, commit_tid);
> + jbd2_log_wait_commit(journal, commit_tid);
> + filemap_write_and_wait(&inode->i_data);
> + }
> truncate_inode_pages(&inode->i_data, 0);
> goto no_delete;
> }
> @@ -1863,6 +1890,7 @@ static int ext4_journalled_write_end(struct file *file,
> if (new_i_size > inode->i_size)
> i_size_write(inode, pos+copied);
> ext4_set_inode_state(inode, EXT4_STATE_JDATA);
> + EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> if (new_i_size > EXT4_I(inode)->i_disksize) {
> ext4_update_i_disksize(inode, new_i_size);
> ret2 = ext4_mark_inode_dirty(handle, inode);
> @@ -2571,6 +2599,7 @@ static int __ext4_journalled_writepage(struct page *page,
> write_end_fn);
> if (ret == 0)
> ret = err;
> + EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> err = ext4_journal_stop(handle);
> if (!ret)
> ret = err;
> --
> 1.7.1
>
Hi Jan,
Patch looks correct to me, but I am uncomfortable with i_datasync_tid
being treated differently
in journalled write - that is, being set on different places in the write paths.
How about setting i_datasync_tid in a more generic place like
ext4_{,da_}write_begin()?
I know it's a bit redundant to setting dirty pages, but at least this
way i_datasync_tid can be
checked in all journal modes and have a consistent meaning.
Perhaps we can even use i_datasync_tid to optimize away things like
fiemap checks for dirty pages.
Just a though.
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists