[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9cef3b97-083e-48e6-aced-3e250df364e3@huaweicloud.com>
Date: Mon, 24 Nov 2025 13:04:04 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, yi.zhang@...wei.com, yizhang089@...il.com,
libaokun1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when
caching on-disk extents
Hi, Ojaswin!
On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
>> Changes since v1:
>> - Rebase the codes based on the latest linux-next 20251120.
>> - Add patches 01-05, fix two stale data problems caused by
>
> Hi Zhang, thanks for the patches.
>
Thank you for take time to look at this series.
> I've always felt uncomfortable with the ZEROOUT code here because it
> seems to have many such bugs as you pointed out in the series. Its very
> fragile and the bugs are easy to miss behind all the data valid and
> split flags mess.
>
Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
significantly increased the complexity of split extents and the
potential for bugs.
> As per my understanding, ZEROOUT logic seems to be a special best-effort
> try to make the split/convert operation "work" when dealing with
> transient errors like ENOSPC etc. I was just wondering if it makes sense
> to just get rid of the whole ZEROOUT logic completely and just reset the
> extent to orig state if there is any error. This allows us to get rid of
> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents()
> slightly less messy.
>
> Maybe we can have a retry loop at the top level caller if we want to try
> again for say ENOSPC or ENOMEM.
>
> Would love to hear your thoughts on it.
>
I think this is a direction worth exploring. However, what I am
currently considering is that we need to address this scenario of
splitting extent during the I/O completion. Although the ZEROOUT logic
is fragile and has many issues recently, it currently serves as a
fallback solution for handling ENOSPC errors that arise when splitting
extents during I/O completion. It ensures that I/O operations do not
fail due to insufficient extent blocks.
Please see ext4_convert_unwritten_extents_endio(). Although we have made
our best effort to tried to split extents using
EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
covered all scenarios. Moreover, after converting the buffered I/O path
to the iomap infrastructure in the future, we may need to split extents
during the I/O completion worker[1].
In most block allocation processes, we already have a retry loop to deal
with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
doesn't seem appropriate to place this logic into the I/O completion
handling process (I haven't thought this solution through deeply yet,
but I'm afraid it could introduce potential deadlock risks due to its
involvement with journal operations), and we can't just simply try again.
If we remove the ZEROOUT logic, we may lose our last line of defense
during the I/O completion.
Currently, I am considering whether it is possible to completely remove
EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
submitting I/Os; instead, all splitting would be performed when
converting extents to written after the I/O completes. Based on my patch,
"ext4: use reserved metadata blocks when splitting extent on endio"[2],
and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
shows no regressions.
So I think the ZEROOUT logic remains somewhat useful until we find better
solution(e.g., making more precise reservations for metadata). Perhaps we
can refactor both the split extent and ZEROOUT logic to make them more
concise.
[1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/
Cheers,
Yi.
> Thanks,
> Ojaswin
>
>> EXT4_EXT_MAY_ZEROOUT when splitting extent.
>> - Add patches 06-07, fix two stale extent status entries problems also
>> caused by splitting extent.
>> - Modify patches 08-10, extend __es_remove_extent() and
>> ext4_es_cache_extent() to allow them to overwrite existing extents of
>> the same status when caching on-disk extents, while also checking
>> extents of different stauts and raising alarms to prevent misuse.
>> - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
>> remove the TODO comment in it.
>>
>> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
>>
>> Original Description
>>
>> This series addresses the optimization that Jan pointed out [1]
>> regarding the introduction of a sequence number to
>> ext4_es_insert_extent(). The proposal is to replace all instances where
>> the cache of on-disk extents is updated by using ext4_es_cache_extent()
>> instead of ext4_es_insert_extent(). This change can prevent excessive
>> cache invalidations caused by unnecessarily increasing the extent
>> sequence number when reading from the on-disk extent tree.
>>
>> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
>>
>> Cheers,
>> Yi.
>>
>> Zhang Yi (13):
>> ext4: cleanup zeroout in ext4_split_extent_at()
>> ext4: subdivide EXT4_EXT_DATA_VALID1
>> ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
>> ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
>> submitting I/O
>> ext4: correct the mapping status if the extent has been zeroed
>> ext4: don't cache extent during splitting extent
>> ext4: drop extent cache before splitting extent
>> ext4: cleanup useless out tag in __es_remove_extent()
>> ext4: make __es_remove_extent() check extent status
>> ext4: make ext4_es_cache_extent() support overwrite existing extents
>> ext4: adjust the debug info in ext4_es_cache_extent()
>> ext4: replace ext4_es_insert_extent() when caching on-disk extents
>> ext4: drop the TODO comment in ext4_es_insert_extent()
>>
>> fs/ext4/extents.c | 127 +++++++++++++++++++++++----------------
>> fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
>> fs/ext4/inode.c | 18 +++---
>> 3 files changed, 176 insertions(+), 90 deletions(-)
>>
>> --
>> 2.46.1
>>
Powered by blists - more mailing lists