linux-ext4 - Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-id: <7A65EEA3-78FD-49C3-B683-265B93F8AC96@sun.com>
Date:	Tue, 01 Dec 2009 15:02:04 -0700
From:	Andreas Dilger <adilger@....com>
To:	Jan Kara <jack@...e.cz>
Cc:	Mike Galbraith <gleep@....de>, James Y Knight <foom@...m.net>,
	ext4 development <linux-ext4@...r.kernel.org>, npiggin@...e.de
Subject: Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64

On 2009-12-01, at 09:03, Jan Kara wrote:
> On Tue 01-12-09 15:35:59, Jan Kara wrote:
>> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
>>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
>>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
>>>>> This test case is distilled from an actual application which  
>>>>> doesn't even intentionally use writev: it just uses C++'s  
>>>>> ofstream class to write data to a file. Unfortunately, that  
>>>>> class smart and uses writev under the covers. Unfortunately, I  
>>>>> guess nobody ever tests linux writev behavior, since it's broken  
>>>>> _so_much_of_the_time_. I really am quite astounded to see such a  
>>>>> bad track record for such a fundamental core system call....

I suspect an excellent way of exposing problems with the writev()  
interface would be to wire it into fsx, which is commonly run as a  
stress test for Linux.  I don't know if it would have caught this  
case, but it definitely couldn't hurt to get more testing cycles for it.

>>  Ext4 also has this problem but delayed allocation mitigates the  
>> effect to an error in accounting of blocks reserved for delayed  
>> allocation and thus under normal circumstances nothing bad happens.

It looks like ext4 might still hit this problem, if delalloc is  
disabled.  Could you please submit a similar patch for ext4 also.

>  The patch below fixes the issue for me...
>
> 									Honza
>
> From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@...e.cz>
> Date: Tue, 1 Dec 2009 16:53:06 +0100
> Subject: [PATCH] ext3: Fix data / filesystem corruption when write  
> fails to copy data
>
> When ext3_write_begin fails after allocating some blocks or  
> generic_perform_write fails to copy data to write, we truncate  
> blocks already instantiated beyond i_size. Although these blocks  
> were never inside i_size, we have to truncate pagecache of these  
> blocks so that corresponding buffers get unmapped. Otherwise  
> subsequent __block_prepare_write (called because we are retrying the  
> write) will find the buffers mapped, not call ->get_block, and thus  
> the page will be backed by already freed blocks leading to  
> filesystem and data corruption.
>
> Signed-off-by: Jan Kara <jack@...e.cz>
> ---
> fs/ext3/inode.c |   18 ++++++++++++++----
> 1 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 354ed3b..f9d6937 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1151,6 +1151,16 @@ static int  
> do_journal_get_write_access(handle_t *handle,
> 	return ext3_journal_get_write_access(handle, bh);
> }
>
> +/*
> + * Truncate blocks that were not used by write. We have to truncate  
> the
> + * pagecache as well so that corresponding buffers get properly  
> unmapped.
> + */
> +static void ext3_truncate_failed_write(struct inode *inode)
> +{
> +	truncate_inode_pages(inode->i_mapping, inode->i_size);
> +	ext3_truncate(inode);
> +}
> +
> static int ext3_write_begin(struct file *file, struct address_space  
> *mapping,
> 				loff_t pos, unsigned len, unsigned flags,
> 				struct page **pagep, void **fsdata)
> @@ -1209,7 +1219,7 @@ write_begin_failed:
> 		unlock_page(page);
> 		page_cache_release(page);
> 		if (pos + len > inode->i_size)
> -			ext3_truncate(inode);
> +			ext3_truncate_failed_write(inode);
> 	}
> 	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> 		goto retry;
> @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file  
> *file,
> 	page_cache_release(page);
>
> 	if (pos + len > inode->i_size)
> -		ext3_truncate(inode);
> +		ext3_truncate_failed_write(inode);
> 	return ret ? ret : copied;
> }
>
> @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct  
> file *file,
> 	page_cache_release(page);
>
> 	if (pos + len > inode->i_size)
> -		ext3_truncate(inode);
> +		ext3_truncate_failed_write(inode);
> 	return ret ? ret : copied;
> }
>
> @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct  
> file *file,
> 	page_cache_release(page);
>
> 	if (pos + len > inode->i_size)
> -		ext3_truncate(inode);
> +		ext3_truncate_failed_write(inode);
> 	return ret ? ret : copied;
> }
>
> -- 
> 1.6.4.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux- 
> ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html