linux-ext4 - Re: [PATCH] ext4: fix unsafe extent initialization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <38cb83c5-600b-c523-8006-5aab0a3eb9b0@huawei.com>
Date:   Mon, 10 Dec 2018 17:26:07 +0800
From:   "zhangyi (F)" <yi.zhang@...wei.com>
To:     "Theodore Y. Ts'o" <tytso@....edu>
CC:     <linux-ext4@...r.kernel.org>, <miaoxie@...wei.com>
Subject: Re: [PATCH] ext4: fix unsafe extent initialization

On 2018/12/10 13:10, Theodore Y. Ts'o Wrote:
> On Sat, Dec 08, 2018 at 10:25:58PM +0800, zhangyi (F) wrote:
>> Current ext4 will call ext4_ext_convert_to_initialized() to split and
>> initialize an unwritten extent if someone write something to it. It may
>> also zeroout the nearby blocks and expand the split extent if the
>> allocated extent is fully inside i_size or new_size. But it may lead to
>> inode inconsistency when system crash or the power fails.
>>
>> Consider the following case:
>>  - Create an empty file and buffer write from block A to D (with delay
>>    allocate). It will update the i_size to D.
>>  - Zero range from part of block B to D. It will allocate an unwritten
>>    extent from B to D.
>>  - The write back worker write block B and initialize the unwritten
>>    extent from B to D, and then update the i_disksize to B.
>>  - System crash.
>>  - Remount and fsck complain about the extent size exceeds the inode
>>    size.
>>
>> This patch add checking i_disksize and chose the small one between
>> i_size to make sure it's safe to convert extent to initialized.
>>
>> ---------------------
>>
>> This problem can reproduce by xfstests generic/482 with fsstress seed
>> 1544025012.
> 
> Hmm, your explanation is great and the patch makes sense.  I haven't
> been able to reproduce the problem by adding -s 1544025012 to the
> fsstress arguments.  This may be because fsstress being run with two
> processes (-p 2) and the failure may be timing dependent?
> 
Yes, it is timing dependent and not quite easy to make a simple fast reproducer.
The default parameter of fsstress (tested by generic/482) on my machine is:

./ltp/fsstress -w -d /mnt/scratch -n 512 -p 8 -s 1544025012 -v

> How easily can you replicate the problem?

It is about 2~5% probability to replicate this problem under generic/482 on my
machine, and it can also appear in generic/019 and generic/455.

After reproducing and checking this problem again, I think the above explanation
is not accurate. I simplify the running process in generic/482 and the real case
could be:

 - Create an empty file and buffer write from block A to D (with delay allocate).
 - Buffer write from X to Z, now the i_size of this inode is updated to Z.
 - Zero range from part of block B to D, it will allocate an unwritten extent
   from B to D. Note that it also will skip disksize update in
   ext4_zero_range() -> ext4_update_disksize_before_punch() because the i_size
   is large than the end of this zero range.
 - The write back kworker write block B and initialize the whole unwritten
   extent from B to D, and then update the i_disksize to the end of B.
 - ext4_journal_stop()
 - kjournald2 process weakup and call jbd2_journal_commit_transaction() to commit
   journal and send FUA.
 - System crash.
 - System reboot and fsck complain about the extent size exceeds the inode size.

Thanks,
Yi.