linux-kernel - Re: [PATCH -next 2/7] ext4: don't split extent before submitting I/O

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7vuttijv2pqx2lgan5rkcw6ofi4uhrsfbmksg4doyq34rjidte@mnfd6cbehncq>
Date: Fri, 19 Dec 2025 16:17:59 +0100
From: Jan Kara <jack@...e.cz>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca, jack@...e.cz, 
	ojaswin@...ux.ibm.com, ritesh.list@...il.com, yi.zhang@...wei.com, yizhang089@...il.com, 
	libaokun1@...wei.com, yangerkun@...wei.com, yukuai@...as.com
Subject: Re: [PATCH -next 2/7] ext4: don't split extent before submitting I/O

On Sat 13-12-25 10:20:03, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@...wei.com>
> 
> Currently, when writing back dirty pages to the filesystem with the
> dioread_nolock feature enabled and when doing DIO, if the area to be
> written back is part of an unwritten extent, the
> EXT4_GET_BLOCKS_IO_CREATE_EXT flag is set during block allocation before
> submitting I/O. The function ext4_split_convert_extents() then attempts
> to split this extent in advance. This approach is designed to prevents
> extent splitting and conversion to the written type from failing due to
> insufficient disk space at the time of I/O completion, which could
> otherwise result in data loss.
> 
> However, we already have two mechanisms to ensure successful extent
> conversion. The first is the EXT4_GET_BLOCKS_METADATA_NOFAIL flag, which
> is a best effort, it permits the use of 2% of the reserved space or
> 4,096 blocks in the file system when splitting extents. This flag covers
> most scenarios where extent splitting might fail. The second is the
> EXT4_EXT_MAY_ZEROOUT flag, which is also set during extent splitting. If
> the reserved space is insufficient and splitting fails, it does not
> retry the allocation. Instead, it directly zeros out the extra part of
> the extent, thereby avoiding splitting and directly converting the
> entire extent to the written type.
> 
> These two mechanisms also exist when I/Os are completed because there is
> a concurrency window between write-back and fallocate, which may still
> require us to split extents upon I/O completion. There is no much
> difference between splitting extents before submitting I/O. Therefore,
> It seems possible to defer the splitting until I/O completion, it won't
> increase the risk of I/O failure and data loss. On the contrary, if some
> I/Os can be merged when I/O completion, it can also reduce unnecessary
> splitting operations, thereby alleviating the pressure on reserved
> space.
> 
> In addition, deferring extent splitting until I/O completion can
> also simplify the IO submission process and avoid initiating unnecessary
> journal handles when writing unwritten extents.
> 
> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@...e.cz>

								Honza

> ---
>  fs/ext4/extents.c | 13 +------------
>  fs/ext4/inode.c   |  4 ++--
>  2 files changed, 3 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index e53959120b04..c98f7c5482b4 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3787,21 +3787,10 @@ ext4_convert_unwritten_extents_endio(handle_t *handle, struct inode *inode,
>  	ext_debug(inode, "logical block %llu, max_blocks %u\n",
>  		  (unsigned long long)ee_block, ee_len);
>  
> -	/* If extent is larger than requested it is a clear sign that we still
> -	 * have some extent state machine issues left. So extent_split is still
> -	 * required.
> -	 * TODO: Once all related issues will be fixed this situation should be
> -	 * illegal.
> -	 */
>  	if (ee_block != map->m_lblk || ee_len > map->m_len) {
>  		int flags = EXT4_GET_BLOCKS_CONVERT |
>  			    EXT4_GET_BLOCKS_METADATA_NOFAIL;
> -#ifdef CONFIG_EXT4_DEBUG
> -		ext4_warning(inode->i_sb, "Inode (%ld) finished: extent logical block %llu,"
> -			     " len %u; IO logical block %llu, len %u",
> -			     inode->i_ino, (unsigned long long)ee_block, ee_len,
> -			     (unsigned long long)map->m_lblk, map->m_len);
> -#endif
> +
>  		path = ext4_split_convert_extents(handle, inode, map, path,
>  						  flags, NULL);
>  		if (IS_ERR(path))
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bb8165582840..ffde24ff7347 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2376,7 +2376,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>  
>  	dioread_nolock = ext4_should_dioread_nolock(inode);
>  	if (dioread_nolock)
> -		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +		get_blocks_flags |= EXT4_GET_BLOCKS_UNWRIT_EXT;
>  
>  	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
>  	if (err < 0)
> @@ -3744,7 +3744,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>  	else if (EXT4_LBLK_TO_B(inode, map->m_lblk) >= i_size_read(inode))
>  		m_flags = EXT4_GET_BLOCKS_CREATE;
>  	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> -		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +		m_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
>  
>  	if (flags & IOMAP_ATOMIC)
>  		ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags,
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR