linux-ext4 - Re: [PATCH v4 5/7] ext4: Add multi-fsblock atomic write support with bigalloc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250515191210.GR25655@frogsfrogsfrogs>
Date: Thu, 15 May 2025 12:12:10 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: Ritesh Harjani <ritesh.list@...il.com>
Cc: linux-ext4@...r.kernel.org, Theodore Ts'o <tytso@....edu>,
	Jan Kara <jack@...e.cz>, John Garry <john.g.garry@...cle.com>,
	Ojaswin Mujoo <ojaswin@...ux.ibm.com>,
	linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v4 5/7] ext4: Add multi-fsblock atomic write support with
 bigalloc

On Thu, May 15, 2025 at 11:45:51PM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@...nel.org> writes:
> 
> > On Thu, May 15, 2025 at 08:15:37PM +0530, Ritesh Harjani (IBM) wrote:
> >> EXT4 supports bigalloc feature which allows the FS to work in size of
> >> clusters (group of blocks) rather than individual blocks. This patch
> >> adds atomic write support for bigalloc so that systems with bs = ps can
> >> also create FS using -
> >>     mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
> >> 
> >> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
> >> adjust ext4's atomic write unit max value to cluster size. This can then support
> >> atomic write of size anywhere between [blocksize, clustersize]. This
> >> patch adds the required changes to enable multi-fsblock atomic write
> >> support using bigalloc in the next patch.
> >> 
> >> In this patch for block allocation:
> >> we first query the underlying region of the requested range by calling
> >> ext4_map_blocks() call. Here are the various cases which we then handle
> >> depending upon the underlying mapping type:
> >> 1. If the underlying region for the entire requested range is a mapped extent,
> >>    then we don't call ext4_map_blocks() to allocate anything. We don't need to
> >>    even start the jbd2 txn in this case.
> >> 2. For an append write case, we create a mapped extent.
> >> 3. If the underlying region is entirely a hole, then we create an unwritten
> >>    extent for the requested range.
> >> 4. If the underlying region is a large unwritten extent, then we split the
> >>    extent into 2 unwritten extent of required size.
> >> 5. If the underlying region has any type of mixed mapping, then we call
> >>    ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
> >>    within the requested range. This then provide a single mapped extent type
> >>    mapping for the requested range.
> >> 
> >> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
> >> flag only when the underlying extent mapping of the requested range is
> >> not entirely a hole, an unwritten extent, or a fully mapped extent. That
> >> is, if the underlying region contains a mix of hole(s), unwritten
> >> extent(s), and mapped extent(s), we use this loop to ensure that all the
> >> short mappings are zeroed out. This guarantees that the entire requested
> >> range becomes a single, uniformly mapped extent. It is ok to do so
> >> because we know this is being done on a bigalloc enabled filesystem
> >> where the block bitmap represents the entire cluster unit.
> >> 
> >> Note having a single contiguous underlying region of type mapped,
> >> unwrittn or hole is not a problem. But the reason to avoid writing on
> >> top of mixed mapping region is because, atomic writes requires all or
> >> nothing should get written for the userspace pwritev2 request. So if at
> >> any point in time during the write if a crash or a sudden poweroff
> >> occurs, the region undergoing atomic write should read either complete
> >> old data or complete new data. But it should never have a mix of both
> >> old and new data.
> >> So, we first convert any mixed mapping region to a single contiguous
> >> mapped extent before any data gets written to it. This is because
> >> normally FS will only convert unwritten extents to written at the end of
> >> the write in ->end_io() call. And if we allow the writes over a mixed
> >> mapping and if a sudden power off happens in between, we will end up
> >> reading mix of new data (over mapped extents) and old data (over
> >> unwritten extents), because unwritten to written conversion never went
> >> through.
> >> So to avoid this and to avoid writes getting torned due to mixed
> >> mapping, we first allocate a single contiguous block mapping and then
> >> do the write.
> >> 
> >> Co-developed-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> >> Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
> >> ---
> >>  fs/ext4/ext4.h    |   2 +
> >>  fs/ext4/extents.c |  87 ++++++++++++++++++++
> >>  fs/ext4/file.c    |   7 +-
> >>  fs/ext4/inode.c   | 200 +++++++++++++++++++++++++++++++++++++++++++++-
> >>  4 files changed, 291 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index ef6cac6b4b4c..8eb1f332ee7d 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -3728,6 +3728,8 @@ extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
> >>  			  loff_t len);
> >>  extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
> >>  					  loff_t offset, ssize_t len);
> >> +extern int ext4_convert_unwritten_extents_atomic(handle_t *handle,
> >> +			struct inode *inode, loff_t offset, ssize_t len);
> >>  extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
> >>  					     ext4_io_end_t *io_end);
> >>  extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
> >> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> >> index fa850f188d46..2967c74dabaf 100644
> >> --- a/fs/ext4/extents.c
> >> +++ b/fs/ext4/extents.c
> >> @@ -4792,6 +4792,93 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >>  	return ret;
> >>  }
> >>  
> >> +/*
> >> + * This function converts a range of blocks to written extents. The caller of
> >> + * this function will pass the start offset and the size. all unwritten extents
> >> + * within this range will be converted to written extents.
> >> + *
> >> + * This function is called from the direct IO end io call back function for
> >> + * atomic writes, to convert the unwritten extents after IO is completed.
> >> + *
> >> + * Note that the requirement for atomic writes is that all conversion should
> >> + * happen atomically in a single fs journal transaction. We mainly only allocate
> >> + * unwritten extents either on a hole on a pre-exiting unwritten extent range in
> >> + * ext4_map_blocks_atomic_write(). The only case where we can have multiple
> >> + * unwritten extents in a range [offset, offset+len) is when there is a split
> >> + * unwritten extent between two leaf nodes which was cached in extent status
> >> + * cache during ext4_iomap_alloc() time. That will allow
> >> + * ext4_map_blocks_atomic_write() to return the unwritten extent range w/o going
> >> + * into the slow path. That means we might need a loop for conversion of this
> >> + * unwritten extent split across leaf block within a single journal transaction.
> >> + * Split extents across leaf nodes is a rare case, but let's still handle that
> >> + * to meet the requirements of multi-fsblock atomic writes.
> >> + *
> >> + * Returns 0 on success.
> >> + */
> >> +int ext4_convert_unwritten_extents_atomic(handle_t *handle, struct inode *inode,
> >> +					  loff_t offset, ssize_t len)
> >> +{
> >> +	unsigned int max_blocks;
> >> +	int ret = 0, ret2 = 0, ret3 = 0;
> >> +	struct ext4_map_blocks map;
> >> +	unsigned int blkbits = inode->i_blkbits;
> >> +	unsigned int credits = 0;
> >> +	int flags = EXT4_GET_BLOCKS_IO_CONVERT_EXT;
> >> +
> >> +	map.m_lblk = offset >> blkbits;
> >> +	max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
> >> +
> >> +	if (!handle) {
> >> +		/*
> >> +		 * TODO: An optimization can be added later by having an extent
> >> +		 * status flag e.g. EXTENT_STATUS_SPLIT_LEAF. If we query that
> >> +		 * it can tell if the extent in the cache is a split extent.
> >> +		 * But for now let's assume pextents as 2 always.
> >> +		 */
> >> +		credits = ext4_meta_trans_blocks(inode, max_blocks, 2);
> >> +	}
> >> +
> >> +	if (credits) {
> >> +		handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
> >> +		if (IS_ERR(handle)) {
> >> +			ret = PTR_ERR(handle);
> >> +			return ret;
> >> +		}
> >> +	}
> >> +
> >> +	while (ret >= 0 && ret < max_blocks) {
> >> +		map.m_lblk += ret;
> >> +		map.m_len = (max_blocks -= ret);
> >> +		ret = ext4_map_blocks(handle, inode, &map, flags);
> >> +		if (ret != max_blocks)
> >> +			ext4_msg(inode->i_sb, KERN_INFO,
> >> +				     "inode #%lu: block %u: len %u: "
> >> +				     "split block mapping found for atomic write, "
> >> +				     "ret = %d",
> >> +				     inode->i_ino, map.m_lblk,
> >> +				     map.m_len, ret);
> >> +		if (ret <= 0)
> >> +			break;
> >> +	}
> >> +
> >> +	ret2 = ext4_mark_inode_dirty(handle, inode);
> >> +
> >> +	if (credits) {
> >> +		ret3 = ext4_journal_stop(handle);
> >> +		if (unlikely(ret3))
> >> +			ret2 = ret3;
> >> +	}
> >> +
> >> +	if (ret <= 0 || ret2)
> >> +		ext4_warning(inode->i_sb,
> >> +			     "inode #%lu: block %u: len %u: "
> >> +			     "returned %d or %d",
> >> +			     inode->i_ino, map.m_lblk,
> >> +			     map.m_len, ret, ret2);
> >> +
> >> +	return ret > 0 ? ret2 : ret;
> >> +}
> >> +
> >>  /*
> >>   * This function convert a range of blocks to written extents
> >>   * The caller of this function will pass the start offset and the size.
> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> >> index beb078ee4811..959328072c15 100644
> >> --- a/fs/ext4/file.c
> >> +++ b/fs/ext4/file.c
> >> @@ -377,7 +377,12 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> >>  	loff_t pos = iocb->ki_pos;
> >>  	struct inode *inode = file_inode(iocb->ki_filp);
> >>  
> >> -	if (!error && size && flags & IOMAP_DIO_UNWRITTEN)
> >> +
> >> +	if (!error && size && (flags & IOMAP_DIO_UNWRITTEN) &&
> >> +			(iocb->ki_flags & IOCB_ATOMIC))
> >> +		error = ext4_convert_unwritten_extents_atomic(NULL, inode, pos,
> >> +							      size);
> >> +	else if (!error && size && flags & IOMAP_DIO_UNWRITTEN)
> >>  		error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
> >>  	if (error)
> >>  		return error;
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 8b86b1a29bdc..13bc9f07ae7f 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3412,12 +3412,149 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >>  	}
> >>  }
> >>  
> >> +static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
> >> +			struct inode *inode, struct ext4_map_blocks *map)
> >> +{
> >> +	ext4_lblk_t m_lblk = map->m_lblk;
> >> +	unsigned int m_len = map->m_len;
> >> +	unsigned int mapped_len = 0, m_flags = 0;
> >> +	ext4_fsblk_t next_pblk;
> >> +	bool check_next_pblk = false;
> >> +	int ret = 0;
> >> +
> >> +	WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb));
> >> +
> >> +	/*
> >> +	 * This is a slow path in case of mixed mapping. We use
> >> +	 * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single
> >> +	 * contiguous mapped mapping. This will ensure any unwritten or hole
> >> +	 * regions within the requested range is zeroed out and we return
> >> +	 * a single contiguous mapped extent.
> >> +	 */
> >> +	m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> >> +
> >> +	do {
> >> +		ret = ext4_map_blocks(handle, inode, map, m_flags);
> >> +		if (ret < 0 && ret != -ENOSPC)
> >> +			goto out_err;
> >> +		/*
> >> +		 * This should never happen, but let's return an error code to
> >> +		 * avoid an infinite loop in here.
> >> +		 */
> >> +		if (ret == 0) {
> >> +			ret = -EFSCORRUPTED;
> >> +			ext4_warning_inode(inode,
> >> +				"ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d",
> >> +				m_flags, ret);
> >> +			goto out_err;
> >> +		}
> >> +		/*
> >> +		 * With bigalloc we should never get ENOSPC nor discontiguous
> >> +		 * physical extents.
> >> +		 */
> >> +		if ((check_next_pblk && next_pblk != map->m_pblk) ||
> >> +				ret == -ENOSPC) {
> >> +			ext4_warning_inode(inode,
> >> +				"Non-contiguous allocation detected: expected %llu, got %llu, "
> >> +				"or ext4_map_blocks() returned out of space ret: %d",
> >> +				next_pblk, map->m_pblk, ret);
> >> +			ret = -EFSCORRUPTED;
> >> +			goto out_err;
> >> +		}
> >> +		next_pblk = map->m_pblk + map->m_len;
> >> +		check_next_pblk = true;
> >> +
> >> +		mapped_len += map->m_len;
> >> +		map->m_lblk += map->m_len;
> >> +		map->m_len = m_len - mapped_len;
> >> +	} while (mapped_len < m_len);
> >> +
> >> +	/*
> >> +	 * We might have done some work in above loop, so we need to query the
> >> +	 * start of the physical extent, based on the origin m_lblk and m_len.
> >> +	 * Let's also ensure we were able to allocate the required range for
> >> +	 * mixed mapping case.
> >> +	 */
> >> +	map->m_lblk = m_lblk;
> >> +	map->m_len = m_len;
> >> +	map->m_flags = 0;
> >> +
> >> +	ret = ext4_map_blocks(handle, inode, map,
> >> +			      EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF);
> >> +	if (ret != m_len) {
> >> +		ext4_warning_inode(inode,
> >> +			"allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n",
> >> +			m_lblk, m_len, ret);
> >> +		ret = -EINVAL;
> >
> > When does this produce a short mapping?  In theory the cluster's already
> > allocated, right?  So this is (AFAICT) a handler for a "should never
> > happen" corner case, right?
> >
> 
> Yes, you are right. This will ideally never happen.
> 
> >> +	}
> >> +	return ret;
> >> +
> >> +out_err:
> >> +	/* reset map before returning an error */
> >> +	map->m_lblk = m_lblk;
> >> +	map->m_len = m_len;
> >> +	map->m_flags = 0;
> >> +	return ret;
> >> +}
> >> +
> >> +/*
> >> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested
> >> + * range in @map [lblk, lblk + len) is one single contiguous extent with no
> >> + * mixed mappings.
> >> + *
> >> + * We first use m_flags passed to us by our caller (ext4_iomap_alloc()).
> >> + * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying
> >> + * physical extent for the requested range does not have a single contiguous
> >> + * mapping type i.e. (Hole, Mapped, or Unwritten) throughout.
> >> + * In that case we will loop over the requested range to allocate and zero out
> >> + * the unwritten / holes in between, to get a single mapped extent from
> >> + * [m_lblk, m_lblk +  m_len). Note that this is only possible because we know
> >> + * this can be called only with bigalloc enabled filesystem where the underlying
> >> + * cluster is already allocated. This avoids allocating discontiguous extents
> >> + * in the slow path due to multiple calls to ext4_map_blocks().
> >> + * The slow path is mostly non-performance critical path, so it should be ok to
> >> + * loop using ext4_map_blocks() with appropriate flags to allocate & zero the
> >> + * underlying short holes/unwritten extents within the requested range.
> >> + */
> >> +static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
> >> +				struct ext4_map_blocks *map, int m_flags,
> >> +				bool *force_commit)
> >> +{
> >> +	ext4_lblk_t m_lblk = map->m_lblk;
> >> +	unsigned int m_len = map->m_len;
> >> +	int ret = 0;
> >> +
> >> +	WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb));
> >> +
> >> +	ret = ext4_map_blocks(handle, inode, map, m_flags);
> >> +	if (ret < 0 || ret == m_len)
> >> +		goto out;
> >> +	/*
> >> +	 * This is a mixed mapping case where we were not able to allocate
> >> +	 * a single contiguous extent. In that case let's reset requested
> >> +	 * mapping and call the slow path.
> >> +	 */
> >> +	map->m_lblk = m_lblk;
> >> +	map->m_len = m_len;
> >> +	map->m_flags = 0;
> >> +
> >> +	/*
> >> +	 * slow path means we have mixed mapping, that means we will need
> >> +	 * to force txn commit.
> >> +	 */
> >> +	*force_commit = true;
> >> +	return ext4_map_blocks_atomic_write_slow(handle, inode, map);
> >> +out:
> >> +	return ret;
> >> +}
> >> +
> >>  static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> >>  			    unsigned int flags)
> >>  {
> >>  	handle_t *handle;
> >>  	u8 blkbits = inode->i_blkbits;
> >>  	int ret, dio_credits, m_flags = 0, retries = 0;
> >> +	bool force_commit = false;
> >>  
> >>  	/*
> >>  	 * Trim the mapping request to the maximum value that we can map at
> >> @@ -3425,7 +3562,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> >>  	 */
> >>  	if (map->m_len > DIO_MAX_BLOCKS)
> >>  		map->m_len = DIO_MAX_BLOCKS;
> >> -	dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> >> +
> >> +	/*
> >> +	 * journal credits estimation for atomic writes. We call
> >> +	 * ext4_map_blocks(), to find if there could be a mixed mapping. If yes,
> >> +	 * then let's assume the no. of pextents required can be m_len i.e.
> >> +	 * every alternate block can be unwritten and hole.
> >> +	 */
> >> +	if (flags & IOMAP_ATOMIC) {
> >> +		unsigned int orig_mlen = map->m_len;
> >> +
> >> +		ret = ext4_map_blocks(NULL, inode, map, 0);
> >> +		if (ret < 0)
> >> +			return ret;
> >> +		if (map->m_len < orig_mlen) {
> >> +			map->m_len = orig_mlen;
> >> +			dio_credits = ext4_meta_trans_blocks(inode, orig_mlen,
> >> +							     map->m_len);
> >> +		} else {
> >> +			dio_credits = ext4_chunk_trans_blocks(inode,
> >> +							      map->m_len);
> >> +		}
> >> +	} else {
> >> +		dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> >> +	}
> >>  
> >>  retry:
> >>  	/*
> >> @@ -3456,7 +3616,11 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> >>  	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >>  		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> >>  
> >> -	ret = ext4_map_blocks(handle, inode, map, m_flags);
> >> +	if (flags & IOMAP_ATOMIC)
> >> +		ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags,
> >> +						   &force_commit);
> >> +	else
> >> +		ret = ext4_map_blocks(handle, inode, map, m_flags);
> >>  
> >>  	/*
> >>  	 * We cannot fill holes in indirect tree based inodes as that could
> >> @@ -3470,6 +3634,14 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> >>  	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> >>  		goto retry;
> >>  
> >> +	if (ret > 0 && force_commit) {
> >> +		int ret2;
> >> +
> >> +		ret2 = ext4_force_commit(inode->i_sb);
> >> +		if (ret2)
> >> +			ret = ret2;
> >
> > Nit: This could return ret2 directly instead of assigning it to ret and
> > letting it fall out.
> 
> Ok. Will return ret2 directly here.
> 
> >
> > But my bigger complaint is that you ought to leave a comment here along
> > the lines of:
> >
> 
> yes, I should have added a comment.
> 
> > 		/*
> > 		 * Someone forced us to commit the journal ahead of an
> > 		 * IO operation so that the ondisk mapping state is
> > 		 * consistent with the contents of the file data blocks.
> > 		 * The commit failed, so we abort the whole IO.
> > 		 */
> >
> > so it's obvious why we got a mapping but are erroring out anyway.
> 
> 	/*
> 	 * Force commit the current transaction if the allocation spans a mixed
> 	 * mapping range. This ensures any pending metadata updates (like
> 	 * unwritten to written extents conversion) in this range are in
> 	 * consistent state with the file data blocks, before performing the
> 	 * actual write I/O. If the commit fails, the whole I/O must be aborted
> 	 * to prevent any possible torn writes.
> 	 */
> 
> I am thinking will add above ^^^ and something similar to a section
> where we talk about how mixed mappings are handled in Documentation.
> 
> >
> > If the answers to my questions are all 'yes' and the extra comment gets
> > added, then
> >
> > Acked-by: "Darrick J. Wong" <djwong@...nel.org>
> >
> > --D
> >
> 
> Thanks for the review! Since above answers are all yes, and I will add
> the comment I just mentioned abovem so I will take your Acked by too and
> soon send a v5.
> 
> Please let me know otherwise.

That sounds ok to me.  I'm glad this is finally getting done. :)

--D

> 
> Thanks a lot for reviewing!
> 
> -ritesh
> 
> >> +	}
> >> +
> >>  	return ret;
> >>  }
> >>  
> >> @@ -3480,6 +3652,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> >>  	int ret;
> >>  	struct ext4_map_blocks map;
> >>  	u8 blkbits = inode->i_blkbits;
> >> +	unsigned int orig_mlen;
> >>  
> >>  	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> >>  		return -EINVAL;
> >> @@ -3493,6 +3666,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> >>  	map.m_lblk = offset >> blkbits;
> >>  	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >>  			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> >> +	orig_mlen = map.m_len;
> >>  
> >>  	if (flags & IOMAP_WRITE) {
> >>  		/*
> >> @@ -3503,8 +3677,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> >>  		 */
> >>  		if (offset + length <= i_size_read(inode)) {
> >>  			ret = ext4_map_blocks(NULL, inode, &map, 0);
> >> -			if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
> >> -				goto out;
> >> +			/*
> >> +			 * For atomic writes the entire requested length should
> >> +			 * be mapped.
> >> +			 */
> >> +			if (map.m_flags & EXT4_MAP_MAPPED) {
> >> +				if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
> >> +				   (flags & IOMAP_ATOMIC && ret >= orig_mlen))
> >> +					goto out;
> >> +			}
> >> +			map.m_len = orig_mlen;
> >>  		}
> >>  		ret = ext4_iomap_alloc(inode, &map, flags);
> >>  	} else {
> >> @@ -3525,6 +3707,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> >>  	 */
> >>  	map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
> >>  
> >> +	/*
> >> +	 * Before returning to iomap, let's ensure the allocated mapping
> >> +	 * covers the entire requested length for atomic writes.
> >> +	 */
> >> +	if (flags & IOMAP_ATOMIC) {
> >> +		if (map.m_len < (length >> blkbits)) {
> >> +			WARN_ON(1);
> >> +			return -EINVAL;
> >> +		}
> >> +	}
> >>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >>  
> >>  	return 0;
> >> -- 
> >> 2.49.0
> >> 
> >> 
>