lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 26 Apr 2024 18:27:30 +0530
From: Ritesh Harjani (IBM) <ritesh.list@...il.com>
To: Zhang Yi <yi.zhang@...weicloud.com>, linux-ext4@...r.kernel.org
Cc: linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca, jack@...e.cz, hch@...radead.org, djwong@...nel.org, david@...morbit.com, willy@...radead.org, zokeefe@...gle.com, yi.zhang@...wei.com, yi.zhang@...weicloud.com, chengzhihao1@...wei.com, yukuai3@...wei.com, wangkefeng.wang@...wei.com
Subject: Re: [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block

Ritesh Harjani (IBM) <ritesh.list@...il.com> writes:

> Zhang Yi <yi.zhang@...weicloud.com> writes:
>
>> From: Zhang Yi <yi.zhang@...wei.com>
>>
>> Now we lookup extent status entry without holding the i_data_sem before
>> inserting delalloc block, it works fine in buffered write path and
>> because it holds i_rwsem and folio lock, and the mmap path holds folio
>> lock, so the found extent locklessly couldn't be modified concurrently.
>> But it could be raced by fallocate since it allocate block whitout
>> holding i_rwsem and folio lock.
>>
>> ext4_page_mkwrite()             ext4_fallocate()
>>  block_page_mkwrite()
>>   ext4_da_map_blocks()
>>    //find hole in extent status tree
>>                                  ext4_alloc_file_blocks()
>>                                   ext4_map_blocks()
>>                                    //allocate block and unwritten extent
>>    ext4_insert_delayed_block()
>>     ext4_da_reserve_space()
>>      //reserve one more block
>>     ext4_es_insert_delayed_block()
>>      //drop unwritten extent and add delayed extent by mistake
>>
>> Then, the delalloc extent is wrong until writeback, the one more
>> reserved block can't be release any more and trigger below warning:
>>
>>  EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared!
>>
>> Hold i_data_sem in write mode directly can fix the problem, but it's
>> expansive, we should keep the lockless check and check the extent again
>> once we need to add an new delalloc block.
>
> Hi Zhang, 
>
> It's a nice finding. I was wondering if this was caught in any of the
> xfstests?
>
> I have reworded some of the commit message, feel free to use it if you
> think this version is better. The use of which path uses which locks was
> a bit confusing in the original commit message.
>
> <reworded from your original commit msg>
>
> ext4_da_map_blocks(), first looks up the extent status tree for any
> extent entry with i_data_sem held in read mode. It then unlocks
> i_data_sem, if it can't find an entry and take this lock in write
> mode for inserting a new da entry.

Sorry about this above paragraph. I messed this paragraph.
Here is the correct version of this.

ext4_da_map_blocks looks up for any extent entry in the extent status
tree (w/o i_data_sem) and then the looks up for any ondisk extent
mapping (with i_data_sem in read mode).

If it finds a hole in the extent status tree or if it couldn't find any
entry at all, it then takes the i_data_sem in write mode to add a da entry
into the extent status tree. This can actually race with page mkwrite
& fallocate path. 

Note that this is ok between...  <and the rest can remain same>

>
> This is ok between -
> 1. ext4 buffered-write path v/s ext4_page_mkwrite(), because of the
> folio lock
> 2. ext4 buffered write path v/s ext4 fallocate because of the inode
> lock.
>


> But this can race between ext4_page_mkwrite() & ext4 fallocate path - 
>
>  ext4_page_mkwrite()             ext4_fallocate()
>   block_page_mkwrite()
>    ext4_da_map_blocks()
>     //find hole in extent status tree
>                                   ext4_alloc_file_blocks()
>                                    ext4_map_blocks()
>                                     //allocate block and unwritten extent
>     ext4_insert_delayed_block()
>      ext4_da_reserve_space()
>       //reserve one more block
>      ext4_es_insert_delayed_block()
>       //drop unwritten extent and add delayed extent by mistake
>
> Then, the delalloc extent is wrong until writeback and the extra
> reserved block can't be released any more and it triggers below warning:
>
>   EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared!
>
> This patch fixes the problem by looking up extent status tree again
> while the i_data_sem is held in write mode. If it still can't find
> any entry, then we insert a new da entry into the extent status tree.
>
>>
>> Cc: stable@...r.kernel.org
>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>> ---
>>  fs/ext4/inode.c | 19 +++++++++++++++++++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 6a41172c06e1..118b0497a954 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1737,6 +1737,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>>  		if (ext4_es_is_hole(&es))
>>  			goto add_delayed;
>>  
>> +found:
>>  		/*
>>  		 * Delayed extent could be allocated by fallocate.
>>  		 * So we need to check it.
>> @@ -1781,6 +1782,24 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>>  
>>  add_delayed:
>>  	down_write(&EXT4_I(inode)->i_data_sem);
>> +	/*
>> +	 * Lookup extents tree again under i_data_sem, make sure this
>> +	 * inserting delalloc range haven't been delayed or allocated
>> +	 * whitout holding i_rwsem and folio lock.
>> +	 */
>
> page fault path (ext4_page_mkwrite does not take i_rwsem) and fallocate
> path (no folio lock) can race. Make sure we lookup the extent status
> tree here again while i_data_sem is held in write mode, before inserting
> a new da entry in the extent status tree.
>
>


-ritesh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ