[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3243c67d-e783-4ec5-998f-0b6170f36e35@huaweicloud.com>
Date: Sun, 28 Apr 2024 11:00:36 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>, linux-ext4@...r.kernel.org
Cc: linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, hch@...radead.org, djwong@...nel.org, david@...morbit.com,
willy@...radead.org, zokeefe@...gle.com, yi.zhang@...wei.com,
chengzhihao1@...wei.com, yukuai3@...wei.com, wangkefeng.wang@...wei.com
Subject: Re: [PATCH v4 02/34] ext4: check the extent status again before
inserting delalloc block
On 2024/4/27 0:39, Ritesh Harjani (IBM) wrote:
> Zhang Yi <yi.zhang@...weicloud.com> writes:
>
>> On 2024/4/26 20:57, Ritesh Harjani (IBM) wrote:
>>> Ritesh Harjani (IBM) <ritesh.list@...il.com> writes:
>>>
>>>> Zhang Yi <yi.zhang@...weicloud.com> writes:
>>>>
>>>>> From: Zhang Yi <yi.zhang@...wei.com>
>>>>>
>>>>> Now we lookup extent status entry without holding the i_data_sem before
>>>>> inserting delalloc block, it works fine in buffered write path and
>>>>> because it holds i_rwsem and folio lock, and the mmap path holds folio
>>>>> lock, so the found extent locklessly couldn't be modified concurrently.
>>>>> But it could be raced by fallocate since it allocate block whitout
>>>>> holding i_rwsem and folio lock.
>>>>>
>>>>> ext4_page_mkwrite() ext4_fallocate()
>>>>> block_page_mkwrite()
>>>>> ext4_da_map_blocks()
>>>>> //find hole in extent status tree
>>>>> ext4_alloc_file_blocks()
>>>>> ext4_map_blocks()
>>>>> //allocate block and unwritten extent
>>>>> ext4_insert_delayed_block()
>>>>> ext4_da_reserve_space()
>>>>> //reserve one more block
>>>>> ext4_es_insert_delayed_block()
>>>>> //drop unwritten extent and add delayed extent by mistake
>>>>>
>>>>> Then, the delalloc extent is wrong until writeback, the one more
>>>>> reserved block can't be release any more and trigger below warning:
>>>>>
>>>>> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared!
>>>>>
>>>>> Hold i_data_sem in write mode directly can fix the problem, but it's
>>>>> expansive, we should keep the lockless check and check the extent again
>>>>> once we need to add an new delalloc block.
>>>>
>>>> Hi Zhang,
>>>>
>>>> It's a nice finding. I was wondering if this was caught in any of the
>>>> xfstests?
>>>>
>>
>> Hi, Ritesh
>>
>> I caught this issue when I tested my iomap series in generic/344 and
>> generic/346. It's easy to reproduce because the iomap's buffered write path
>> doesn't hold folio lock while inserting delalloc blocks, so it could be raced
>> by the mmap page fault path. But the buffer_head's buffered write path can't
>> trigger this problem,
>
> ya right! That's the difference between how ->map_blocks() is called
> between buffer_head v/s iomap path. In iomap the ->map_blocks() call
> happens first to map a large extent and then it iterate over all the
> locked folios covering the mapped extent for doing writes.
> Whereas in buffer_head while iterating, we first instantiate/lock the
> folio and then call ->map_blocks() to map an extent for the given folio.
>
> ... So this opens up this window for a race between iomap buffered write
> path v/s page mkwrite path for inserting delalloc blocks entries.
>
>> the race between buffered write path and fallocate path
>> was discovered while I was analyzing the code, so I'm not sure if it could
>> be caught by xfstests now, at least I haven't noticed this problem so far.
>>
>
> Did you mean the race between page fault path and fallocate path here?
> Because buffered write path and fallocate path should not have any race
> since both takes the inode_lock. I guess you meant page fault path and
> fallocate path for which you wrote this patch too :)
Yep.
>
> I am surprised, why we cannot see the this race between page mkwrite and
> fallocate in fstests for inserting da entries to extent status cache.
> Because the race you identified looks like a legitimate race and is
> mostly happening since ext4_da_map_blocks() was not doing the right
> thing.
> ... looking at the src/holetest, it doesn't really excercise this path.
> So maybe we can writing such fstest to trigger this race.
>
I guess the stress tests and smoke tests in fstests have caught it,
e.g. generic/476. Since there is only one error message in ext4_destroy_inode()
when the race issue happened, we can't detect it unless we go and check the logs
manually.
I suppose we need to add more warnings, something like this, how does it sound?
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c8b691e605f1..4b6fd9b63b12 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1255,6 +1255,8 @@ static void ext4_percpu_param_destroy(struct ext4_sb_info *sbi)
percpu_counter_destroy(&sbi->s_freeclusters_counter);
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
+ WARN_ON_ONCE(!ext4_forced_shutdown(sbi->s_sb) &&
+ percpu_counter_sum(&sbi->s_dirtyclusters_counter));
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit);
percpu_free_rwsem(&sbi->s_writepages_rwsem);
@@ -1476,7 +1478,8 @@ static void ext4_destroy_inode(struct inode *inode)
dump_stack();
}
- if (EXT4_I(inode)->i_reserved_data_blocks)
+ if (!ext4_forced_shutdown(inode->i_sb) &&
+ WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
ext4_msg(inode->i_sb, KERN_ERR,
"Inode %lu (%p): i_reserved_data_blocks (%u) not cleared!",
inode->i_ino, EXT4_I(inode),
Thanks,
Yi.
>
>>>> I have reworded some of the commit message, feel free to use it if you
>>>> think this version is better. The use of which path uses which locks was
>>>> a bit confusing in the original commit message.
>>>>
>>
>> Thanks for the message improvement, it looks more clear then mine, I will
>> use it.
>>
>
> Glad, it was helpful.
>
> -ritesh
>
Powered by blists - more mailing lists