lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aSRUohHsq3MsiGv0@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Mon, 24 Nov 2025 18:20:42 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
        jack@...e.cz, yi.zhang@...wei.com, yizhang089@...il.com,
        libaokun1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when
 caching on-disk extents

On Mon, Nov 24, 2025 at 01:04:04PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
> 
> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
> > On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
> >> Changes since v1:
> >>  - Rebase the codes based on the latest linux-next 20251120.
> >>  - Add patches 01-05, fix two stale data problems caused by
> > 
> > Hi Zhang, thanks for the patches.
> > 
> 
> Thank you for take time to look at this series.
> 
> > I've always felt uncomfortable with the ZEROOUT code here because it
> > seems to have many such bugs as you pointed out in the series. Its very
> > fragile and the bugs are easy to miss behind all the data valid and
> > split flags mess. 
> > 
> 
> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
> significantly increased the complexity of split extents and the
> potential for bugs.
> 
> > As per my understanding, ZEROOUT logic seems to be a special best-effort
> > try to make the split/convert operation "work" when dealing with
> > transient errors like ENOSPC etc. I was just wondering if it makes sense
> > to just get rid of the whole ZEROOUT logic completely and just reset the
> > extent to orig state if there is any error. This allows us to get rid of
> > DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
> > slightly less messy.
> > 
> > Maybe we can have a retry loop at the top level caller if we want to try
> > again for say ENOSPC or ENOMEM. 
> > 
> > Would love to hear your thoughts on it.
> > 
> 
> I think this is a direction worth exploring. However, what I am
> currently considering is that we need to address this scenario of
> splitting extent during the I/O completion. Although the ZEROOUT logic
> is fragile and has many issues recently, it currently serves as a
> fallback solution for handling ENOSPC errors that arise when splitting
> extents during I/O completion. It ensures that I/O operations do not
> fail due to insufficient extent blocks.
> 
> Please see ext4_convert_unwritten_extents_endio(). Although we have made
> our best effort to tried to split extents using
> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
> covered all scenarios. Moreover, after converting the buffered I/O path
> to the iomap infrastructure in the future, we may need to split extents
> during the I/O completion worker[1].
> 
> In most block allocation processes, we already have a retry loop to deal
> with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
> doesn't seem appropriate to place this logic into the I/O completion
> handling process (I haven't thought this solution through deeply yet,
> but I'm afraid it could introduce potential deadlock risks due to its
> involvement with journal operations), and we can't just simply try again.
> If we remove the ZEROOUT logic, we may lose our last line of defense
> during the I/O completion.
> 
> Currently, I am considering whether it is possible to completely remove
> EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
> submitting I/Os; instead, all splitting would be performed when
> converting extents to written after the I/O completes. Based on my patch,
> "ext4: use reserved metadata blocks when splitting extent on endio"[2],
> and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
> shows no regressions.
> 
> So I think the ZEROOUT logic remains somewhat useful until we find better
> solution(e.g., making more precise reservations for metadata). Perhaps we
> can refactor both the split extent and ZEROOUT logic to make them more
> concise.

Hi Yi,

Okay it makese sense to keep the zeroout if iomap path is planning to
shift the extent splitting to endio. Plus, I agree based on the comments
in the ext4_convert_unwritte_extents_endio() that we might even today
need to split in endio (although i cant recall when this happens) which
would need a zeroout fallback.

And yes, I think refactoring the whole logic to be less confusing would
be better. I had an older unposted untested patch cleaning up some of
this, I was looking at it again today and there seems to be a lot of
cleanups we can do here but that becomes out of scope of this patchset I
believe.

Sure then, lets keep it as it is for now. I'll review the changes you
made and later I can post a patch refactoring this area.

Regards,
ojaswin

> 
> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
> [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/
> 
> Cheers,
> Yi.
> 
> > Thanks,
> > Ojaswin
> > 
> >>    EXT4_EXT_MAY_ZEROOUT when splitting extent.
> >>  - Add patches 06-07, fix two stale extent status entries problems also
> >>    caused by splitting extent.
> >>  - Modify patches 08-10, extend __es_remove_extent() and
> >>    ext4_es_cache_extent() to allow them to overwrite existing extents of
> >>    the same status when caching on-disk extents, while also checking
> >>    extents of different stauts and raising alarms to prevent misuse.
> >>  - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
> >>    remove the TODO comment in it.
> >>
> >> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
> >>
> >> Original Description
> >>
> >> This series addresses the optimization that Jan pointed out [1]
> >> regarding the introduction of a sequence number to
> >> ext4_es_insert_extent(). The proposal is to replace all instances where
> >> the cache of on-disk extents is updated by using ext4_es_cache_extent()
> >> instead of ext4_es_insert_extent(). This change can prevent excessive
> >> cache invalidations caused by unnecessarily increasing the extent
> >> sequence number when reading from the on-disk extent tree.
> >>
> >> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
> >>
> >> Cheers,
> >> Yi.
> >>
> >> Zhang Yi (13):
> >>   ext4: cleanup zeroout in ext4_split_extent_at()
> >>   ext4: subdivide EXT4_EXT_DATA_VALID1
> >>   ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
> >>   ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
> >>     submitting I/O
> >>   ext4: correct the mapping status if the extent has been zeroed
> >>   ext4: don't cache extent during splitting extent
> >>   ext4: drop extent cache before splitting extent
> >>   ext4: cleanup useless out tag in __es_remove_extent()
> >>   ext4: make __es_remove_extent() check extent status
> >>   ext4: make ext4_es_cache_extent() support overwrite existing extents
> >>   ext4: adjust the debug info in ext4_es_cache_extent()
> >>   ext4: replace ext4_es_insert_extent() when caching on-disk extents
> >>   ext4: drop the TODO comment in ext4_es_insert_extent()
> >>
> >>  fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
> >>  fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
> >>  fs/ext4/inode.c          |  18 +++---
> >>  3 files changed, 176 insertions(+), 90 deletions(-)
> >>
> >> -- 
> >> 2.46.1
> >>
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ