[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <646a8643-3d4e-4213-8922-34574a18231e@huaweicloud.com>
Date: Mon, 24 Nov 2025 22:05:53 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, yi.zhang@...wei.com, yizhang089@...il.com,
libaokun1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when
caching on-disk extents
On 11/24/2025 8:50 PM, Ojaswin Mujoo wrote:
> On Mon, Nov 24, 2025 at 01:04:04PM +0800, Zhang Yi wrote:
>> Hi, Ojaswin!
>>
>> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
>>> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
>>>> Changes since v1:
>>>> - Rebase the codes based on the latest linux-next 20251120.
>>>> - Add patches 01-05, fix two stale data problems caused by
>>>
>>> Hi Zhang, thanks for the patches.
>>>
>>
>> Thank you for take time to look at this series.
>>
>>> I've always felt uncomfortable with the ZEROOUT code here because it
>>> seems to have many such bugs as you pointed out in the series. Its very
>>> fragile and the bugs are easy to miss behind all the data valid and
>>> split flags mess.
>>>
>>
>> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
>> significantly increased the complexity of split extents and the
>> potential for bugs.
>>
>>> As per my understanding, ZEROOUT logic seems to be a special best-effort
>>> try to make the split/convert operation "work" when dealing with
>>> transient errors like ENOSPC etc. I was just wondering if it makes sense
>>> to just get rid of the whole ZEROOUT logic completely and just reset the
>>> extent to orig state if there is any error. This allows us to get rid of
>>> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents()
>>> slightly less messy.
>>>
>>> Maybe we can have a retry loop at the top level caller if we want to try
>>> again for say ENOSPC or ENOMEM.
>>>
>>> Would love to hear your thoughts on it.
>>>
>>
>> I think this is a direction worth exploring. However, what I am
>> currently considering is that we need to address this scenario of
>> splitting extent during the I/O completion. Although the ZEROOUT logic
>> is fragile and has many issues recently, it currently serves as a
>> fallback solution for handling ENOSPC errors that arise when splitting
>> extents during I/O completion. It ensures that I/O operations do not
>> fail due to insufficient extent blocks.
>>
>> Please see ext4_convert_unwritten_extents_endio(). Although we have made
>> our best effort to tried to split extents using
>> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
>> covered all scenarios. Moreover, after converting the buffered I/O path
>> to the iomap infrastructure in the future, we may need to split extents
>> during the I/O completion worker[1].
>>
>> In most block allocation processes, we already have a retry loop to deal
>> with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
>> doesn't seem appropriate to place this logic into the I/O completion
>> handling process (I haven't thought this solution through deeply yet,
>> but I'm afraid it could introduce potential deadlock risks due to its
>> involvement with journal operations), and we can't just simply try again.
>> If we remove the ZEROOUT logic, we may lose our last line of defense
>> during the I/O completion.
>>
>> Currently, I am considering whether it is possible to completely remove
>> EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
>> submitting I/Os; instead, all splitting would be performed when
>> converting extents to written after the I/O completes. Based on my patch,
>> "ext4: use reserved metadata blocks when splitting extent on endio"[2],
>> and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
>> shows no regressions.
>>
>> So I think the ZEROOUT logic remains somewhat useful until we find better
>> solution(e.g., making more precise reservations for metadata). Perhaps we
>> can refactor both the split extent and ZEROOUT logic to make them more
>> concise.
>
> Hi Yi,
>
> Okay it makese sense to keep the zeroout if iomap path is planning to
> shift the extent splitting to endio. Plus, I agree based on the comments
> in the ext4_convert_unwritte_extents_endio() that we might even today
> need to split in endio (although i cant recall when this happens) which
> would need a zeroout fallback.
A relatively common case is the concurrency between folio write-back and
fallocate. After an unwritten extent is allocated during the writeback
process, if fallocate is performed before the I/O completes, the unwritten
extent created by fallocate will merge with the writeback portion.
Consequently, a split operation is required once the I/O completes.
>
> And yes, I think refactoring the whole logic to be less confusing would
> be better. I had an older unposted untested patch cleaning up some of
> this, I was looking at it again today and there seems to be a lot of
> cleanups we can do here but that becomes out of scope of this patchset I
> believe.
>
> Sure then, lets keep it as it is for now. I'll review the changes you
> made and later I can post a patch refactoring this area.
OK, thank you a lot for your review and look forward to your patch.
Thanks,
Yi.
>
> Regards,
> ojaswin
>
>>
>> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
>> [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/
>>
>> Cheers,
>> Yi.
>>
>>> Thanks,
>>> Ojaswin
>>>
>>>> EXT4_EXT_MAY_ZEROOUT when splitting extent.
>>>> - Add patches 06-07, fix two stale extent status entries problems also
>>>> caused by splitting extent.
>>>> - Modify patches 08-10, extend __es_remove_extent() and
>>>> ext4_es_cache_extent() to allow them to overwrite existing extents of
>>>> the same status when caching on-disk extents, while also checking
>>>> extents of different stauts and raising alarms to prevent misuse.
>>>> - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
>>>> remove the TODO comment in it.
>>>>
>>>> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
>>>>
>>>> Original Description
>>>>
>>>> This series addresses the optimization that Jan pointed out [1]
>>>> regarding the introduction of a sequence number to
>>>> ext4_es_insert_extent(). The proposal is to replace all instances where
>>>> the cache of on-disk extents is updated by using ext4_es_cache_extent()
>>>> instead of ext4_es_insert_extent(). This change can prevent excessive
>>>> cache invalidations caused by unnecessarily increasing the extent
>>>> sequence number when reading from the on-disk extent tree.
>>>>
>>>> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
>>>>
>>>> Cheers,
>>>> Yi.
>>>>
>>>> Zhang Yi (13):
>>>> ext4: cleanup zeroout in ext4_split_extent_at()
>>>> ext4: subdivide EXT4_EXT_DATA_VALID1
>>>> ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
>>>> ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
>>>> submitting I/O
>>>> ext4: correct the mapping status if the extent has been zeroed
>>>> ext4: don't cache extent during splitting extent
>>>> ext4: drop extent cache before splitting extent
>>>> ext4: cleanup useless out tag in __es_remove_extent()
>>>> ext4: make __es_remove_extent() check extent status
>>>> ext4: make ext4_es_cache_extent() support overwrite existing extents
>>>> ext4: adjust the debug info in ext4_es_cache_extent()
>>>> ext4: replace ext4_es_insert_extent() when caching on-disk extents
>>>> ext4: drop the TODO comment in ext4_es_insert_extent()
>>>>
>>>> fs/ext4/extents.c | 127 +++++++++++++++++++++++----------------
>>>> fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
>>>> fs/ext4/inode.c | 18 +++---
>>>> 3 files changed, 176 insertions(+), 90 deletions(-)
>>>>
>>>> --
>>>> 2.46.1
>>>>
>>
Powered by blists - more mailing lists