[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aSlaeSGfjlZbY3hB@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Fri, 28 Nov 2025 13:46:57 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, yi.zhang@...wei.com, yizhang089@...il.com,
libaokun1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 07/13] ext4: drop extent cache before splitting extent
On Thu, Nov 27, 2025 at 03:27:26PM +0800, Zhang Yi wrote:
> On 11/26/2025 8:24 PM, Ojaswin Mujoo wrote:
> > On Fri, Nov 21, 2025 at 02:08:05PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@...wei.com>
> >>
> >> When splitting an unwritten extent in the middle and converting it to
> >> initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and
> >> EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent.
> >>
> >> Assume we have an unwritten file and buffered write in the middle of it
> >> without dioread_nolock enabled, it will allocate blocks as written
> >> extent.
> >>
> >> 0 A B N
> >> [UUUUUUUUUUUU] on-disk extent U: unwritten extent
> >> [UUUUUUUUUUUU] extent status tree
> >> [--DDDDDDDD--] D: valid data
> >> |<- ->| ----> this range needs to be initialized
> >>
> >> ext4_split_extent() first try to split this extent at B with
> >> EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> >> ext4_split_extent_at() failed to split this extent due to temporary lack
> >> of space. It zeroout B to N and leave the entire extent as unwritten.
> >>
> >> 0 A B N
> >> [UUUUUUUUUUUU] on-disk extent
> >> [UUUUUUUUUUUU] extent status tree
> >> [--DDDDDDDDZZ] Z: zeroed data
> >>
> >> ext4_split_extent() then try to split this extent at A with
> >> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and
> >> leave
> >> an written extent from A to N.
> >
> > Hi Yi,
> >
> > thanks for the detailed description. I'm trying to understand the
> > codepath a bit and I believe you are talking about:
> >
> > ext4_ext_handle_unwritten_extents()
> > ext4_ext_convert_to_initialized()
> > // Case 5: split 1 unwrit into 3 parts and convert to init
> > ext4_split_extent()
>
> Yes, but in fact, it should be Case 1: split the extent into three
> extents.
Yes right my bad :)
>
> >
> > in which case, after the second split succeeds
> >>
> >> 0 A B N
> >> [UU|WWWWWWWWWW] on-disk extent W: written extent
> >> [UU|UUUUUUUUUU] extent status tree
> >
> > WHen will extent status get split into 2 unwrit extents as you show
> > above? I seem to be missing that call since IIUC ext4_ext_insert_extent
> > itself doesn't seem to be accounting for the newly inserted extent in es.
> >
>
> Sorry for the confusion. This was drawn because I couldn't find a
> suitable symbol, so I followed the representation method used for
> on-disk extents. In fact, there is no splitting of extent status entries
> here. I have updated the last two graphs as follows(different types of
> extents are considered as different extents):
>
> 0 A B N
> [UUWWWWWWWWWW] on-disk extent W: written extent
> [UUUUUUUUUUUU] extent status tree
> [--DDDDDDDDZZ]
>
> 0 A B N
> [UUWWWWWWWWWW] on-disk extent W: written extent
> [UUWWWWWWWWUU] extent status tree
> [--DDDDDDDDZZ]
Thanks for confirming, I think according to our representation, the
following is what happens:
0 A B N
[UU|WWWWWWWWWW] on-disk extent W: written extent <--split
[UUUUUUUUUUUUU] extent status tree <---- no split
[---DDDDDDDDZZ]
0 A B N
[UU|WWWWWWWWWW] on-disk extent W: written extent <--split
[UU|WWWWWWW|UU] extent status tree <--- split
[---DDDDDDDZZZ]
Anyways, I just had this doubt while trying to understand this codepath
so thanks for clarifying :)
>
> Will this make it easier to understand?
>
> Cheers,
> Yi.
>
>
> > Regards,
> > ojaswin
> >
> >> [--|DDDDDDDDZZ]
> >
> >>
> >> Finally ext4_map_create_blocks() only insert extent A to B to the extent
> >> status tree, and leave an stale unwritten extent in the status tree.
> >>
> >> 0 A B N
> >> [UU|WWWWWWWWWW] on-disk extent W: written extent
> >> [UU|WWWWWWWWUU] extent status tree
> >> [--|DDDDDDDDZZ]
> >>
> >> Fix this issue by always remove cached extent status entry before
> >> splitting extent.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> >> ---
> >> fs/ext4/extents.c | 6 ++++++
> >> 1 file changed, 6 insertions(+)
> >>
> >> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> >> index 2b5aec3f8882..9bb80af4b5cf 100644
> >> --- a/fs/ext4/extents.c
> >> +++ b/fs/ext4/extents.c
> >> @@ -3367,6 +3367,12 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
> >> ee_len = ext4_ext_get_actual_len(ex);
> >> unwritten = ext4_ext_is_unwritten(ex);
> >>
> >> + /*
> >> + * Drop extent cache to prevent stale unwritten extents remaining
> >> + * after zeroing out.
> >> + */
> >> + ext4_es_remove_extent(inode, ee_block, ee_len);
> >> +
Okay this makes sense, there are many different combinations of how the
on disk extents might turn out and if will become complicated to keep
the es in sync to those, so this seems simpler.
There might be a small performance penalty of dropping the es here tho
but idk if it's anything measurable. As a middle ground do you think it
makes more sense to drop the es cache in ext4_split_extent_at() instead,
when we know we are about to go for zeroout. Since the default non
zeroout path seems to be okay?
Regards,
ojaswin
> >> /* Do not cache extents that are in the process of being modified. */
> >> flags |= EXT4_EX_NOCACHE;
> >>
> >> --
> >> 2.46.1
> >>
>
Powered by blists - more mailing lists