linux-ext4 - Re: [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with bigalloc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <Z-KWsWHOGJnq8pUp@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Tue, 25 Mar 2025 17:12:41 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        John Garry <john.g.garry@...cle.com>, djwong@...nel.org,
        linux-xfs@...r.kernel.org, "Theodore Ts'o" <tytso@....edu>
Subject: Re: [RFCv1 1/1] ext4: Add multi-fsblock atomic write support with
 bigalloc

On Sun, Mar 23, 2025 at 12:32:18PM +0530, Ritesh Harjani (IBM) wrote:
> EXT4 supports bigalloc feature which allows the FS to work in size of
> clusters (group of blocks) rather than individual blocks. This patch
> adds atomic write support for bigalloc so that systems with bs = ps can
> also create FS using -
>     mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
> 
> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
> adjust ext4's atomic write unit max value to cluster size. This can then support
> atomic write of size anywhere between [blocksize, clustersize].
> 
> We first query the underlying region of the requested range by calling
> ext4_map_blocks() call. Here are the various cases which we then handle
> for block allocation depending upon the underlying mapping type:
> 1. If the underlying region for the entire requested range is a mapped extent,
>    then we don't call ext4_map_blocks() to allocate anything. We don't need to
>    even start the jbd2 txn in this case.
> 2. For an append write case, we create a mapped extent.
> 3. If the underlying region is entirely a hole, then we create an unwritten
>    extent for the requested range.
> 4. If the underlying region is a large unwritten extent, then we split the
>    extent into 2 unwritten extent of required size.
> 5. If the underlying region has any type of mixed mapping, then we call
>    ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
>    within the requested range. This then provide a single mapped extent type
>    mapping for the requested range.
> 
> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
> flag only when the underlying extent mapping of the requested range is
> not entirely a hole, an unwritten extent, or a fully mapped extent. That
> is, if the underlying region contains a mix of hole(s), unwritten
> extent(s), and mapped extent(s), we use this loop to ensure that all the
> short mappings are zeroed out. This guarantees that the entire requested
> range becomes a single, uniformly mapped extent. It is ok to do so
> because we know this is being done on a bigalloc enabled filesystem
> where the block bitmap represents the entire cluster unit.

Hi Ritesh, thanks for the patch. The approach looks good to me, just
adding a few comments below.
> 
> Cc: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
> ---
>  fs/ext4/inode.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/ext4/super.c |  8 +++--
>  2 files changed, 93 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d04d8a7f12e7..0096a597ad04 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3332,6 +3332,67 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>     iomap->addr = IOMAP_NULL_ADDR;
>   }
>  }
> +/*
> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested mapping
> + * [map.m_lblk, map.m_len] is one single contiguous extent with no mixed
> + * mappings. This function is only called when the bigalloc is enabled, so we
> + * know that the allocated physical extent start is always aligned properly.
> + *
> + * We call EXT4_GET_BLOCKS_ZERO only when the underlying physical extent for the
> + * requested range does not have a single mapping type (Hole, Mapped, or
> + * Unwritten) throughout. In that case we will loop over the requested range to
> + * allocate and zero out the unwritten / holes in between, to get a single
> + * mapped extent from [m_lblk, m_len]. This case is mostly non-performance
> + * critical path, so it should be ok to loop using ext4_map_blocks() with
> + * appropriate flags to allocate & zero the underlying short holes/unwritten
> + * extents within the requested range.
> + */
> +static int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
> +         struct ext4_map_blocks *map)
> +{
> + ext4_lblk_t m_lblk = map->m_lblk;
> + unsigned int m_len = map->m_len;
> + unsigned int mapped_len = 0, flags = 0;
> + u8 blkbits = inode->i_blkbits;
> + int ret;
> +
> + WARN_ON(!ext4_has_feature_bigalloc(inode->i_sb));
> +
> + ret = ext4_map_blocks(handle, inode, map, 0);
> + if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
> +   flags = EXT4_GET_BLOCKS_CREATE;
> + else if ((ret == 0 && map->m_len >= m_len) ||
> +   (ret >= m_len && map->m_flags & EXT4_MAP_UNWRITTEN))
> +   flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> + else
> +   flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> +
> + do {
> +   ret = ext4_map_blocks(handle, inode, map, flags);

With the multiple calls to map block for converting the extents, I don't
think the transaction reservation wouldn't be enough anymore since in
the worst case we could be converting atleast (max atomicwrite size / blocksize) 
extents. We need to account for that as well.

> +   if (ret < 0)
> +     return ret;
> +   mapped_len += map->m_len;
> +   map->m_lblk += map->m_len;
> +   map->m_len = m_len - mapped_len;
> + } while (mapped_len < m_len);

> +
> + map->m_lblk = m_lblk;
> + map->m_len = m_len;
> +
> + /*
> +  * We might have done some work in above loop. Let's ensure we query the
> +  * start of the physical extent, based on the origin m_lblk and m_len
> +  * and also ensure we were able to allocate the required range for doing
> +  * atomic write.
> +  */
> + ret = ext4_map_blocks(handle, inode, map, 0);

 Here, We are calling ext4_map_blocks() 3 times uneccessarily even if a
 single complete mapping is found. I think a better approach would be to
 just go for the map_blocks and then decide if we want to split. Also,
 factor out a function to do the zero out. So, somthing like:

  if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
    flags = EXT4_GET_BLOCKS_CREATE;
  else
    flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;

        ret = ext4_map_blocks(handle, inode, map, flags);

        if (map->m_len < m_len) {
          map->m_len = m_len;

                /* do the zero out */
          ext4_zero_mixed_mappings(handle, inode, map);
                ext4_map_blocks(handle, inode, map, 0);

                WARN_ON(!(map->m_flags & EXT4_MAP_MAPPED) || map->m_len < m_len);
        }

 I think this covers the 5 cases you mentioned in the commit message, if
 I'm not missing anything.  Also, this way we avoid the duplication for
 non zero-out cases and the zero-out function can then be resused incase
 we want to do the same for forcealign atomic writes in the future.

Regards,
ojaswin

> + if (ret != m_len) {
> +   ext4_warning_inode(inode, "allocation failed for atomic write request pos:%u, len:%u\n",
> +       m_lblk, m_len);
> +   return -EINVAL;
> + }
> + return mapped_len;
> +}
> 
>  static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>           unsigned int flags)
> @@ -3377,7 +3438,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>   else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>     m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> 
> - ret = ext4_map_blocks(handle, inode, map, m_flags);
> + if (flags & IOMAP_ATOMIC && ext4_has_feature_bigalloc(inode->i_sb))
> +   ret = ext4_map_blocks_atomic(handle, inode, map);
> + else
> +   ret = ext4_map_blocks(handle, inode, map, m_flags);
> 
>   /*
>    * We cannot fill holes in indirect tree based inodes as that could
> @@ -3401,6 +3465,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   int ret;
>   struct ext4_map_blocks map;
>   u8 blkbits = inode->i_blkbits;
> + unsigned int m_len_orig;
> 
>   if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>     return -EINVAL;
> @@ -3414,6 +3479,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   map.m_lblk = offset >> blkbits;
>   map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>         EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> + m_len_orig = map.m_len;
> 
>   if (flags & IOMAP_WRITE) {
>     /*
> @@ -3424,8 +3490,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>      */
>     if (offset + length <= i_size_read(inode)) {
>       ret = ext4_map_blocks(NULL, inode, &map, 0);
> -     if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
> -       goto out;
> +     /*
> +      * For atomic writes the entire requested length should
> +      * be mapped.
> +      */
> +     if (map.m_flags & EXT4_MAP_MAPPED) {
> +       if ((!(flags & IOMAP_ATOMIC) && ret > 0) ||
> +          (flags & IOMAP_ATOMIC && ret >= m_len_orig))
> +         goto out;
> +     }
> +     map.m_len = m_len_orig;
>     }
>     ret = ext4_iomap_alloc(inode, &map, flags);
>   } else {
> @@ -3442,6 +3516,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>    */
>   map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
> 
> + /*
> +  * Before returning to iomap, let's ensure the allocated mapping
> +  * covers the entire requested length for atomic writes.
> +  */
> + if (flags & IOMAP_ATOMIC) {
> +   if (map.m_len < (length >> blkbits)) {
> +     WARN_ON(1);
> +     return -EINVAL;
> +   }
> + }
>   ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> 
>   return 0;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a50e5c31b937..cbb24d535d59 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4442,12 +4442,13 @@ static int ext4_handle_clustersize(struct super_block *sb)
>  /*
>   * ext4_atomic_write_init: Initializes filesystem min & max atomic write units.
>   * @sb: super block
> - * TODO: Later add support for bigalloc
>   */
>  static void ext4_atomic_write_init(struct super_block *sb)
>  {
>   struct ext4_sb_info *sbi = EXT4_SB(sb);
>   struct block_device *bdev = sb->s_bdev;
> + unsigned int blkbits = sb->s_blocksize_bits;
> + unsigned int clustersize = sb->s_blocksize;
> 
>   if (!bdev_can_atomic_write(bdev))
>     return;
> @@ -4455,9 +4456,12 @@ static void ext4_atomic_write_init(struct super_block *sb)
>   if (!ext4_has_feature_extents(sb))
>     return;
> 
> + if (ext4_has_feature_bigalloc(sb))
> +   clustersize = 1U << (sbi->s_cluster_bits + blkbits);
> +
>   sbi->s_awu_min = max(sb->s_blocksize,
>             bdev_atomic_write_unit_min_bytes(bdev));
> - sbi->s_awu_max = min(sb->s_blocksize,
> + sbi->s_awu_max = min(clustersize,
>             bdev_atomic_write_unit_max_bytes(bdev));
>   if (sbi->s_awu_min && sbi->s_awu_max &&
>       sbi->s_awu_min <= sbi->s_awu_max) {
> --
> 2.48.1
>