[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aSRUohHsq3MsiGv0@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Mon, 24 Nov 2025 18:20:42 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, yi.zhang@...wei.com, yizhang089@...il.com,
libaokun1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when
caching on-disk extents
On Mon, Nov 24, 2025 at 01:04:04PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
>
> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
> > On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
> >> Changes since v1:
> >> - Rebase the codes based on the latest linux-next 20251120.
> >> - Add patches 01-05, fix two stale data problems caused by
> >
> > Hi Zhang, thanks for the patches.
> >
>
> Thank you for take time to look at this series.
>
> > I've always felt uncomfortable with the ZEROOUT code here because it
> > seems to have many such bugs as you pointed out in the series. Its very
> > fragile and the bugs are easy to miss behind all the data valid and
> > split flags mess.
> >
>
> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
> significantly increased the complexity of split extents and the
> potential for bugs.
>
> > As per my understanding, ZEROOUT logic seems to be a special best-effort
> > try to make the split/convert operation "work" when dealing with
> > transient errors like ENOSPC etc. I was just wondering if it makes sense
> > to just get rid of the whole ZEROOUT logic completely and just reset the
> > extent to orig state if there is any error. This allows us to get rid of
> > DATA_VALID* flags as well and makes the whole ext4_split_convert_extents()
> > slightly less messy.
> >
> > Maybe we can have a retry loop at the top level caller if we want to try
> > again for say ENOSPC or ENOMEM.
> >
> > Would love to hear your thoughts on it.
> >
>
> I think this is a direction worth exploring. However, what I am
> currently considering is that we need to address this scenario of
> splitting extent during the I/O completion. Although the ZEROOUT logic
> is fragile and has many issues recently, it currently serves as a
> fallback solution for handling ENOSPC errors that arise when splitting
> extents during I/O completion. It ensures that I/O operations do not
> fail due to insufficient extent blocks.
>
> Please see ext4_convert_unwritten_extents_endio(). Although we have made
> our best effort to tried to split extents using
> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
> covered all scenarios. Moreover, after converting the buffered I/O path
> to the iomap infrastructure in the future, we may need to split extents
> during the I/O completion worker[1].
>
> In most block allocation processes, we already have a retry loop to deal
> with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
> doesn't seem appropriate to place this logic into the I/O completion
> handling process (I haven't thought this solution through deeply yet,
> but I'm afraid it could introduce potential deadlock risks due to its
> involvement with journal operations), and we can't just simply try again.
> If we remove the ZEROOUT logic, we may lose our last line of defense
> during the I/O completion.
>
> Currently, I am considering whether it is possible to completely remove
> EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
> submitting I/Os; instead, all splitting would be performed when
> converting extents to written after the I/O completes. Based on my patch,
> "ext4: use reserved metadata blocks when splitting extent on endio"[2],
> and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
> shows no regressions.
>
> So I think the ZEROOUT logic remains somewhat useful until we find better
> solution(e.g., making more precise reservations for metadata). Perhaps we
> can refactor both the split extent and ZEROOUT logic to make them more
> concise.
Hi Yi,
Okay it makese sense to keep the zeroout if iomap path is planning to
shift the extent splitting to endio. Plus, I agree based on the comments
in the ext4_convert_unwritte_extents_endio() that we might even today
need to split in endio (although i cant recall when this happens) which
would need a zeroout fallback.
And yes, I think refactoring the whole logic to be less confusing would
be better. I had an older unposted untested patch cleaning up some of
this, I was looking at it again today and there seems to be a lot of
cleanups we can do here but that becomes out of scope of this patchset I
believe.
Sure then, lets keep it as it is for now. I'll review the changes you
made and later I can post a patch refactoring this area.
Regards,
ojaswin
>
> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
> [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/
>
> Cheers,
> Yi.
>
> > Thanks,
> > Ojaswin
> >
> >> EXT4_EXT_MAY_ZEROOUT when splitting extent.
> >> - Add patches 06-07, fix two stale extent status entries problems also
> >> caused by splitting extent.
> >> - Modify patches 08-10, extend __es_remove_extent() and
> >> ext4_es_cache_extent() to allow them to overwrite existing extents of
> >> the same status when caching on-disk extents, while also checking
> >> extents of different stauts and raising alarms to prevent misuse.
> >> - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
> >> remove the TODO comment in it.
> >>
> >> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
> >>
> >> Original Description
> >>
> >> This series addresses the optimization that Jan pointed out [1]
> >> regarding the introduction of a sequence number to
> >> ext4_es_insert_extent(). The proposal is to replace all instances where
> >> the cache of on-disk extents is updated by using ext4_es_cache_extent()
> >> instead of ext4_es_insert_extent(). This change can prevent excessive
> >> cache invalidations caused by unnecessarily increasing the extent
> >> sequence number when reading from the on-disk extent tree.
> >>
> >> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
> >>
> >> Cheers,
> >> Yi.
> >>
> >> Zhang Yi (13):
> >> ext4: cleanup zeroout in ext4_split_extent_at()
> >> ext4: subdivide EXT4_EXT_DATA_VALID1
> >> ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
> >> ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
> >> submitting I/O
> >> ext4: correct the mapping status if the extent has been zeroed
> >> ext4: don't cache extent during splitting extent
> >> ext4: drop extent cache before splitting extent
> >> ext4: cleanup useless out tag in __es_remove_extent()
> >> ext4: make __es_remove_extent() check extent status
> >> ext4: make ext4_es_cache_extent() support overwrite existing extents
> >> ext4: adjust the debug info in ext4_es_cache_extent()
> >> ext4: replace ext4_es_insert_extent() when caching on-disk extents
> >> ext4: drop the TODO comment in ext4_es_insert_extent()
> >>
> >> fs/ext4/extents.c | 127 +++++++++++++++++++++++----------------
> >> fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
> >> fs/ext4/inode.c | 18 +++---
> >> 3 files changed, 176 insertions(+), 90 deletions(-)
> >>
> >> --
> >> 2.46.1
> >>
>
Powered by blists - more mailing lists