linux-ext4 - Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZWpntZXm32kyfX7M@dread.disaster.area>
Date: Sat, 2 Dec 2023 10:09:41 +1100
From: Dave Chinner <david@...morbit.com>
To: Ritesh Harjani <ritesh.list@...il.com>
Cc: Christoph Hellwig <hch@...radead.org>, Jan Kara <jack@...e.cz>,
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use
 iomap

On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
> Ritesh Harjani (IBM) <ritesh.list@...il.com> writes:
> 
> > Christoph Hellwig <hch@...radead.org> writes:
> >
> >> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >>> writeback bit set. XFS plays the revalidation sequence counter games
> >>> because of this so we'd have to do something similar for ext2. Not that I'd
> >>> care as much about ext2 writeback performance but it should not be that
> >>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >>> you give that a try (as a followup "performance improvement" patch).
> >>
> >> Darrick has mentioned that he is looking into lifting more of the
> >> validation sequence counter validation into iomap.
> >>
> >> In the meantime I have a series here that at least maps multiple blocks
> >> inside a folio in a single go, which might be worth trying with ext2 as
> >> well:
> >>
> >> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks
> >
> > Sure, thanks for providing details. I will check this.
> >
> 
> So I took a look at this. Here is what I think -
> So this is useful of-course when we have a large folio. Because
> otherwise it's just one block at a time for each folio. This is not a
> problem once FS buffered-io handling code moves to iomap (because we
> can then enable large folio support to it).
> 
> However, this would still require us to pass a folio to ->map_blocks
> call to determine the size of the folio (which I am not saying can't be
> done but just stating my observations here).
> 
> Now coming to implementing validation seq check. I am hoping, it should
> be straight forward at least for ext2, because it mostly just have to
> protect against truncate operation (which can change the block mapping)...
> 
> ...ok so here is the PoC for seq counter check for ext2. Next let me
> try to see if we can lift this up from the FS side to iomap - 
> 
> 
> From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001
> From: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
> Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks
> 
> This implements inode block seq counter (ib_seq) which helps in validating 
> whether the cached wpc (struct iomap_writepage_ctx) still has the valid 
> entries or not. Block mapping can get changed say due to a racing truncate. 
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
> ---
>  fs/ext2/balloc.c |  1 +
>  fs/ext2/ext2.h   |  6 ++++++
>  fs/ext2/inode.c  | 51 +++++++++++++++++++++++++++++++++++++++++++-----
>  fs/ext2/super.c  |  2 +-
>  4 files changed, 54 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index e124f3d709b2..e97040194da4 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
>  	}
>  
>  	ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
> +	ext2_inc_ib_seq(EXT2_I(inode));
>  
>  do_more:
>  	overflow = 0;
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 677a9ad45dcb..882c14d20183 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -663,6 +663,7 @@ struct ext2_inode_info {
>  	struct rw_semaphore xattr_sem;
>  #endif
>  	rwlock_t i_meta_lock;
> +	unsigned int ib_seq;
>  
>  	/*
>  	 * truncate_mutex is for serialising ext2_truncate() against
> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
>  	return container_of(inode, struct ext2_inode_info, vfs_inode);
>  }
>  
> +static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei)
> +{
> +	WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1);
> +}
> +
>  /* balloc.c */
>  extern int ext2_bg_has_super(struct super_block *sb, int group);
>  extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index f4e0b9a93095..a5490d641c26 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
>  	ext2_fsblk_t current_block = 0;
>  	int ret = 0;
>  
> +	ext2_inc_ib_seq(EXT2_I(inode));
> +
>  	/*
>  	 * Here we try to allocate the requested multiple blocks at once,
>  	 * on a best-effort basis.
> @@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  	return mpage_writepages(mapping, wbc, ext2_get_block);
>  }
>  
> +struct ext2_writepage_ctx {
> +	struct iomap_writepage_ctx ctx;
> +	unsigned int		ib_seq;
> +};

Why copy this structure from XFS? The iomap held in ctx->iomap now
has a 'u64 validity_cookie;' which is expressly intended to be used
for holding this information.

And then this becomes:

> +static inline
> +struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx)
> +{
> +	return container_of(ctx, struct ext2_writepage_ctx, ctx);
> +}
> +
> +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
> +			    loff_t offset)
> +{
> +	if (offset < wpc->iomap.offset ||
> +	    offset >= wpc->iomap.offset + wpc->iomap.length)
> +		return false;
> +
> +	if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq))
> +		return false;

	if (wpc->iomap->validity_cookie != EXT2_I(inode)->ib_seq)
		return false;

> +
> +	return true;
> +}
> +
>  static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>  				 struct inode *inode, loff_t offset)
>  {
> -	if (offset >= wpc->iomap.offset &&
> -	    offset < wpc->iomap.offset + wpc->iomap.length)
> +	loff_t maxblocks = (loff_t)INT_MAX;
> +	u8 blkbits = inode->i_blkbits;
> +	u32 bno;
> +	bool new, boundary;
> +	int ret;
> +
> +	if (ext2_imap_valid(wpc, inode, offset))
>  		return 0;
>  
> -	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> +	EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq);
> +
> +	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> +			      &bno, &new, &boundary, 0);

This is incorrectly ordered. ext2_get_blocks() bumps ib_seq when it
does allocation, so the newly stored EXT2_WPC(wpc)->ib_seq is
immediately staled and the very next call to ext2_imap_valid() will
fail, causing a new iomap to be mapped even when not necessary.

Indeed, we also don't hold the ei->truncate_mutex here, so the
sampling of ib_seq could race with other allocation or truncation
calls on this inode. That's a landmine that will eventually bite
hard, because the sequence number returned with the iomap does not
reflect the state of the extent tree when the iomap is created.

If you look at the XFS code, we set iomap->validity_cookie whenever
we call xfs_bmbt_to_iomap() to format the found/allocated extent
into an iomap to return to the caller. The sequence number is passed
into that function alongside the extent map by the caller, because
when we format the extent into an iomap the caller has already dropped the
allocation lock. We must sample the sequence number after the allocation
but before we drop the allocation lock so that the sequence number
-exactly- matches the extent tree and the extent map we are going to
format into the iomap, otherwise the extent we map into the iomap
may already be stale and the validity cookie won't tell us that.

i.e. the cohrenecy relationship between the validity cookie and the
iomap must be set up from a synchronised state.  If the sequence
number is allowed to change between mapping the extent and sampling
the sequence number, then the extent we map and cache may already be
invalid and this introduces the possibility that the validity cookie
won't always catch that...

> +	if (ret < 0)
> +		return ret;
> +	/*
> +	 * ret can be 0 in case of a hole which is possible for mmaped writes.
> +	 */
> +	ret = ret ? ret : 1;
> +	return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
>  				IOMAP_WRITE, &wpc->iomap, NULL);

So calling ext2_iomap_begin() here might have to change. Indeed,
given that we just allocated the blocks and we know where they are
located, calling ext2_iomap_begin()->ext2_get_blocks() to look them
up again just to format them into the iomap seems .... convoluted.

It would be better to factor ext2_iomap_begin() into a call to
ext2_get_blocks() and then a "ext2_blocks_to_iomap()" function to
format the returned blocks into the iomap that is returned. Then
ext2_write_map_blocks() can just call ext2_blocks_to_iomap() here
and it skips the unnecessary block mapping lookup....

It also means that the ib_seq returned by the allocation can be fed
directly into the iomap->validity_cookie...

-Dave.
-- 
Dave Chinner
david@...morbit.com