[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20091202105354.f7137c98.rdunlap@xenotime.net>
Date: Wed, 2 Dec 2009 10:53:54 -0800
From: Randy Dunlap <rdunlap@...otime.net>
To: Andreas Dilger <adilger@....com>, ltp-list@...ts.sourceforge.net
Cc: Jan Kara <jack@...e.cz>, Mike Galbraith <gleep@....de>,
James Y Knight <foom@...m.net>,
ext4 development <linux-ext4@...r.kernel.org>, npiggin@...e.de
Subject: Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
On Tue, 01 Dec 2009 15:02:04 -0700 Andreas Dilger wrote:
> On 2009-12-01, at 09:03, Jan Kara wrote:
> > On Tue 01-12-09 15:35:59, Jan Kara wrote:
> >> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> >>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> >>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> >>>>> This test case is distilled from an actual application which
> >>>>> doesn't even intentionally use writev: it just uses C++'s
> >>>>> ofstream class to write data to a file. Unfortunately, that
> >>>>> class smart and uses writev under the covers. Unfortunately, I
> >>>>> guess nobody ever tests linux writev behavior, since it's broken
> >>>>> _so_much_of_the_time_. I really am quite astounded to see such a
> >>>>> bad track record for such a fundamental core system call....
>
> I suspect an excellent way of exposing problems with the writev()
> interface would be to wire it into fsx, which is commonly run as a
> stress test for Linux. I don't know if it would have caught this
> case, but it definitely couldn't hurt to get more testing cycles for it.
Maybe someone from LTP would be interested in adding this test functionality
to fsx-linux ?
Source/test program is available at
http://marc.info/?l=linux-kernel&m=125961612418323&w=2
> >> Ext4 also has this problem but delayed allocation mitigates the
> >> effect to an error in accounting of blocks reserved for delayed
> >> allocation and thus under normal circumstances nothing bad happens.
>
> It looks like ext4 might still hit this problem, if delalloc is
> disabled. Could you please submit a similar patch for ext4 also.
>
> > The patch below fixes the issue for me...
> >
> > Honza
> >
> > From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@...e.cz>
> > Date: Tue, 1 Dec 2009 16:53:06 +0100
> > Subject: [PATCH] ext3: Fix data / filesystem corruption when write
> > fails to copy data
> >
> > When ext3_write_begin fails after allocating some blocks or
> > generic_perform_write fails to copy data to write, we truncate
> > blocks already instantiated beyond i_size. Although these blocks
> > were never inside i_size, we have to truncate pagecache of these
> > blocks so that corresponding buffers get unmapped. Otherwise
> > subsequent __block_prepare_write (called because we are retrying the
> > write) will find the buffers mapped, not call ->get_block, and thus
> > the page will be backed by already freed blocks leading to
> > filesystem and data corruption.
> >
> > Signed-off-by: Jan Kara <jack@...e.cz>
> > ---
> > fs/ext3/inode.c | 18 ++++++++++++++----
> > 1 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index 354ed3b..f9d6937 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -1151,6 +1151,16 @@ static int
> > do_journal_get_write_access(handle_t *handle,
> > return ext3_journal_get_write_access(handle, bh);
> > }
> >
> > +/*
> > + * Truncate blocks that were not used by write. We have to truncate
> > the
> > + * pagecache as well so that corresponding buffers get properly
> > unmapped.
> > + */
> > +static void ext3_truncate_failed_write(struct inode *inode)
> > +{
> > + truncate_inode_pages(inode->i_mapping, inode->i_size);
> > + ext3_truncate(inode);
> > +}
> > +
> > static int ext3_write_begin(struct file *file, struct address_space
> > *mapping,
> > loff_t pos, unsigned len, unsigned flags,
> > struct page **pagep, void **fsdata)
> > @@ -1209,7 +1219,7 @@ write_begin_failed:
> > unlock_page(page);
> > page_cache_release(page);
> > if (pos + len > inode->i_size)
> > - ext3_truncate(inode);
> > + ext3_truncate_failed_write(inode);
> > }
> > if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> > goto retry;
> > @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file
> > *file,
> > page_cache_release(page);
> >
> > if (pos + len > inode->i_size)
> > - ext3_truncate(inode);
> > + ext3_truncate_failed_write(inode);
> > return ret ? ret : copied;
> > }
> >
> > @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct
> > file *file,
> > page_cache_release(page);
> >
> > if (pos + len > inode->i_size)
> > - ext3_truncate(inode);
> > + ext3_truncate_failed_write(inode);
> > return ret ? ret : copied;
> > }
> >
> > @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct
> > file *file,
> > page_cache_release(page);
> >
> > if (pos + len > inode->i_size)
> > - ext3_truncate(inode);
> > + ext3_truncate_failed_write(inode);
> > return ret ? ret : copied;
> > }
> >
> > --
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists