linux-ext4 - Re: [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ldqzypx0.fsf@gmail.com>
Date: Thu, 15 May 2025 00:34:43 +0530
From: Ritesh Harjani (IBM) <ritesh.list@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: linux-ext4@...r.kernel.org, Theodore Ts'o <tytso@....edu>, Jan Kara <jack@...e.cz>, John Garry <john.g.garry@...cle.com>, Ojaswin Mujoo <ojaswin@...ux.ibm.com>, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v3 5/7] ext4: Add multi-fsblock atomic write support with bigalloc

"Darrick J. Wong" <djwong@...nel.org> writes:

> On Fri, May 09, 2025 at 02:20:35AM +0530, Ritesh Harjani (IBM) wrote:
>> EXT4 supports bigalloc feature which allows the FS to work in size of
>> clusters (group of blocks) rather than individual blocks. This patch
>> adds atomic write support for bigalloc so that systems with bs = ps can
>> also create FS using -
>>     mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
>> 
>> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
>> adjust ext4's atomic write unit max value to cluster size. This can then support
>> atomic write of size anywhere between [blocksize, clustersize]. This
>> patch adds the required changes to enable multi-fsblock atomic write
>> support using bigalloc in the next patch.
>> 
>> In this patch for block allocation:
>> we first query the underlying region of the requested range by calling
>> ext4_map_blocks() call. Here are the various cases which we then handle
>> depending upon the underlying mapping type:
>> 1. If the underlying region for the entire requested range is a mapped extent,
>>    then we don't call ext4_map_blocks() to allocate anything. We don't need to
>>    even start the jbd2 txn in this case.
>> 2. For an append write case, we create a mapped extent.
>> 3. If the underlying region is entirely a hole, then we create an unwritten
>>    extent for the requested range.
>> 4. If the underlying region is a large unwritten extent, then we split the
>>    extent into 2 unwritten extent of required size.
>> 5. If the underlying region has any type of mixed mapping, then we call
>>    ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
>>    within the requested range. This then provide a single mapped extent type
>>    mapping for the requested range.
>> 
>> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
>> flag only when the underlying extent mapping of the requested range is
>> not entirely a hole, an unwritten extent, or a fully mapped extent. That
>> is, if the underlying region contains a mix of hole(s), unwritten
>> extent(s), and mapped extent(s), we use this loop to ensure that all the
>> short mappings are zeroed out. This guarantees that the entire requested
>> range becomes a single, uniformly mapped extent. It is ok to do so
>> because we know this is being done on a bigalloc enabled filesystem
>> where the block bitmap represents the entire cluster unit.
>> 
>> Note having a single contiguous underlying region of type mapped,
>> unwrittn or hole is not a problem. But the reason to avoid writing on
>> top of mixed mapping region is because, atomic writes requires all or
>> nothing should get written for the userspace pwritev2 request. So if at
>> any point in time during the write if a crash or a sudden poweroff
>> occurs, the region undergoing atomic write should read either complete
>> old data or complete new data. But it should never have a mix of both
>> old and new data.
>> So, we first convert any mixed mapping region to a single contiguous
>> mapped extent before any data gets written to it. This is because
>> normally FS will only convert unwritten extents to written at the end of
>> the write in ->end_io() call. And if we allow the writes over a mixed
>> mapping and if a sudden power off happens in between, we will end up
>> reading mix of new data (over mapped extents) and old data (over
>> unwritten extents), because unwritten to written conversion never went
>> through.
>> So to avoid this and to avoid writes getting torned due to mixed
>> mapping, we first allocate a single contiguous block mapping and then
>> do the write.
>> 
>> Co-developed-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
>> Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
>
> Looks fine (I don't like the pre-zeroing but options are limited on
> ext4) except for one thing...
>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 8b86b1a29bdc..2642e1ef128f 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3412,6 +3412,136 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>>  	}
>>  }
>>  
>> +static int ext4_map_blocks_atomic_write_slow(handle_t *handle,
>> +			struct inode *inode, struct ext4_map_blocks *map)
>> +{
>> +	ext4_lblk_t m_lblk = map->m_lblk;
>> +	unsigned int m_len = map->m_len;
>> +	unsigned int mapped_len = 0, m_flags = 0;
>> +	ext4_fsblk_t next_pblk;
>> +	bool check_next_pblk = false;
>> +	int ret = 0;
>> +
>> +	WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb));
>> +
>> +	/*
>> +	 * This is a slow path in case of mixed mapping. We use
>> +	 * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single
>> +	 * contiguous mapped mapping. This will ensure any unwritten or hole
>> +	 * regions within the requested range is zeroed out and we return
>> +	 * a single contiguous mapped extent.
>> +	 */
>> +	m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
>> +
>> +	do {
>> +		ret = ext4_map_blocks(handle, inode, map, m_flags);
>> +		if (ret < 0 && ret != -ENOSPC)
>> +			goto out_err;
>> +		/*
>> +		 * This should never happen, but let's return an error code to
>> +		 * avoid an infinite loop in here.
>> +		 */
>> +		if (ret == 0) {
>> +			ret = -EFSCORRUPTED;
>> +			ext4_warning_inode(inode,
>> +				"ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d",
>> +				m_flags, ret);
>> +			goto out_err;
>> +		}
>> +		/*
>> +		 * With bigalloc we should never get ENOSPC nor discontiguous
>> +		 * physical extents.
>> +		 */
>> +		if ((check_next_pblk && next_pblk != map->m_pblk) ||
>> +				ret == -ENOSPC) {
>> +			ext4_warning_inode(inode,
>> +				"Non-contiguous allocation detected: expected %llu, got %llu, "
>> +				"or ext4_map_blocks() returned out of space ret: %d",
>> +				next_pblk, map->m_pblk, ret);
>> +			ret = -ENOSPC;
>> +			goto out_err;
>
> If you get physically discontiguous mappings within a cluster, the
> extent tree is corrupt.
>

yes, I guess I was just being hesitant to do that. But you are right,
we should return -EFSCORRUPTED here then. 

I will change the error code along with the other forcecommit change in v4. 

> --D
>

Thanks for the review!

-ritesh


>> +		next_pblk = map->m_pblk + map->m_len;
>> +		check_next_pblk = true;
>> +
>> +		mapped_len += map->m_len;
>> +		map->m_lblk += map->m_len;
>> +		map->m_len = m_len - mapped_len;
>> +	} while (mapped_len < m_len);
>> +
>> +	/*
>> +	 * We might have done some work in above loop, so we need to query the
>> +	 * start of the physical extent, based on the origin m_lblk and m_len.
>> +	 * Let's also ensure we were able to allocate the required range for
>> +	 * mixed mapping case.
>> +	 */
>> +	map->m_lblk = m_lblk;
>> +	map->m_len = m_len;
>> +	map->m_flags = 0;
>> +
>> +	ret = ext4_map_blocks(handle, inode, map,
>> +			      EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF);
>> +	if (ret != m_len) {
>> +		ext4_warning_inode(inode,
>> +			"allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n",
>> +			m_lblk, m_len, ret);
>> +		ret = -EINVAL;
>> +	}
>> +	return ret;
>> +
>> +out_err:
>> +	/* reset map before returning an error */
>> +	map->m_lblk = m_lblk;
>> +	map->m_len = m_len;
>> +	map->m_flags = 0;
>> +	return ret;
>> +}
>> +
>> +/*
>> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested
>> + * range in @map [lblk, lblk + len) is one single contiguous extent with no
>> + * mixed mappings.
>> + *
>> + * We first use m_flags passed to us by our caller (ext4_iomap_alloc()).
>> + * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying
>> + * physical extent for the requested range does not have a single contiguous
>> + * mapping type i.e. (Hole, Mapped, or Unwritten) throughout.
>> + * In that case we will loop over the requested range to allocate and zero out
>> + * the unwritten / holes in between, to get a single mapped extent from
>> + * [m_lblk, m_lblk +  m_len). Note that this is only possible because we know
>> + * this can be called only with bigalloc enabled filesystem where the underlying
>> + * cluster is already allocated. This avoids allocating discontiguous extents
>> + * in the slow path due to multiple calls to ext4_map_blocks().
>> + * The slow path is mostly non-performance critical path, so it should be ok to
>> + * loop using ext4_map_blocks() with appropriate flags to allocate & zero the
>> + * underlying short holes/unwritten extents within the requested range.
>> + */
>> +static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode,
>> +				struct ext4_map_blocks *map, int m_flags)
>> +{
>> +	ext4_lblk_t m_lblk = map->m_lblk;
>> +	unsigned int m_len = map->m_len;
>> +	int ret = 0;
>> +
>> +	WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb));
>> +
>> +	ret = ext4_map_blocks(handle, inode, map, m_flags);
>> +	if (ret < 0 || ret == m_len)
>> +		goto out;
>> +	/*
>> +	 * This is a mixed mapping case where we were not able to allocate
>> +	 * a single contiguous extent. In that case let's reset requested
>> +	 * mapping and call the slow path.
>> +	 */
>> +	map->m_lblk = m_lblk;
>> +	map->m_len = m_len;
>> +	map->m_flags = 0;
>> +
>> +	return ext4_map_blocks_atomic_write_slow(handle, inode, map);
>> +out:
>> +	return ret;
>> +}
>> +
>>  static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>>  			    unsigned int flags)
>>  {
>> @@ -3425,7 +3555,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>>  	 */
>>  	if (map->m_len > DIO_MAX_BLOCKS)
>>  		map->m_len = DIO_MAX_BLOCKS;
>> -	dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
>> +
>> +	/*
>> +	 * journal credits estimation for atomic writes. We call
>> +	 * ext4_map_blocks(), to find if there could be a mixed mapping. If yes,
>> +	 * then let's assume the no. of pextents required can be m_len i.e.
>> +	 * every alternate block can be unwritten and hole.
>> +	 */
>> +	if (flags & IOMAP_ATOMIC) {
>> +		unsigned int orig_mlen = map->m_len;
>> +
>> +		ret = ext4_map_blocks(NULL, inode, map, 0);
>> +		if (ret < 0)
>> +			return ret;
>> +		if (map->m_len < orig_mlen) {
>> +			map->m_len = orig_mlen;
>> +			dio_credits = ext4_meta_trans_blocks(inode, orig_mlen,
>> +							     map->m_len);
>> +		} else {
>> +			dio_credits = ext4_chunk_trans_blocks(inode,
>> +							      map->m_len);
>> +		}
>> +	} else {
>> +		dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
>> +	}
>>  
>>  retry:
>>  	/*
>> @@ -3456,7 +3609,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>>  	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>>  		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>>  
>> -	ret = ext4_map_blocks(handle, inode, map, m_flags);
>> +	if (flags & IOMAP_ATOMIC)
>> +		ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags);
>> +	else
>> +		ret = ext4_map_blocks(handle, inode, map, m_flags);
>>  
>>  	/*
>>  	 * We cannot fill holes in indirect tree based inodes as that could
>> @@ -3480,6 +3636,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>>  	int ret;
>>  	struct ext4_map_blocks map;
>>  	u8 blkbits = inode->i_blkbits;
>> +	unsigned int orig_mlen;
>>  
>>  	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>>  		return -EINVAL;
>> @@ -3493,6 +3650,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>>  	map.m_lblk = offset >> blkbits;
>>  	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>>  			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>> +	orig_mlen = map.m_len;
>>  
>>  	if (flags & IOMAP_WRITE) {
>>  		/*
>> @@ -3503,8 +3661,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>>  		 */
>>  		if (offset + length <= i_size_read(inode)) {
>>  			ret = ext4_map_blocks(NULL, inode, &map, 0);
>> -			if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
>> -				goto out;
>> +			/*
>> +			 * For atomic writes the entire requested length should
>> +			 * be mapped.
>> +			 */
>> +			if (map.m_flags & EXT4_MAP_MAPPED) {
>> +				if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
>> +				   (flags & IOMAP_ATOMIC && ret >= orig_mlen))
>> +					goto out;
>> +			}
>> +			map.m_len = orig_mlen;
>>  		}
>>  		ret = ext4_iomap_alloc(inode, &map, flags);
>>  	} else {
>> @@ -3525,6 +3691,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>>  	 */
>>  	map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
>>  
>> +	/*
>> +	 * Before returning to iomap, let's ensure the allocated mapping
>> +	 * covers the entire requested length for atomic writes.
>> +	 */
>> +	if (flags & IOMAP_ATOMIC) {
>> +		if (map.m_len < (length >> blkbits)) {
>> +			WARN_ON(1);
>> +			return -EINVAL;
>> +		}
>> +	}
>>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>>  
>>  	return 0;
>> -- 
>> 2.49.0
>> 
>>