linux-ext4 - Re: [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240512132003.b6o4ghqk5yjsvkpy@quack3>
Date: Sun, 12 May 2024 15:20:03 +0200
From: Jan Kara <jack@...e.cz>
To: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
Cc: linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, Matthew Wilcox <willy@...radead.org>,
	"Darrick J . Wong" <djwong@...nel.org>,
	Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Jan Kara <jack@...e.cz>
Subject: Re: [RFCv3 4/7] ext2: Implement seq counter for validating cached
 iomap

On Thu 25-04-24 18:58:48, Ritesh Harjani (IBM) wrote:
> There is a possibility of following race with iomap during
> writebck -
> 
> write_cache_pages()
>   cache extent covering 0..1MB range
>   write page at offset 0k
> 					truncate(file, 4k)
> 					  drops all relevant pages
> 					  frees fs blocks
> 					pwrite(file, 4k, 4k)
> 					  creates dirty page in the page cache
>   writes page at offset 4k to a stale block
> 
> This race can happen because iomap_writepages() keeps a cached extent mapping
> within struct iomap. While write_cache_pages() is going over each folio,
> (can cache a large extent range), if a truncate happens in parallel on the
> next folio followed by a buffered write to the same offset within the file,
> this can change logical to physical offset of the cached iomap mapping.
> That means, the cached iomap has now become stale.
> 
> This patch implements the seq counter approach for revalidation of stale
> iomap mappings. i_blkseq will get incremented for every block
> allocation/free. Here is what we do -
> 
> For ext2 buffered-writes, the block allocation happens at the
> ->write_iter time itself. So at writeback time,
> 1. We first cache the i_blkseq.
> 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
>    already allocated.
> 3. Call ext2_get_blocks() the second time with length to be same as
>    the no. of blocks we know were already allocated.
> 4. Till now it means, the cached i_blkseq remains valid as no block
>    allocation has happened yet.
> This means the next call to ->map_blocks(), we can verify whether the
> i_blkseq has raced with truncate or not. If not, then i_blkseq will
> remain valid.
> 
> In case of a hole (could happen with mmaped writes), we only allocate
> 1 block at a time anyways. So even if the i_blkseq value changes right
> after, we anyway need to allocate the next block in subsequent
> ->map_blocks() call.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>

A few small comments below.

> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
>  	return container_of(inode, struct ext2_inode_info, vfs_inode);
>  }
>  
> +static inline void ext2_inc_i_blkseq(struct ext2_inode_info *ei)
> +{
> +	WRITE_ONCE(ei->i_blkseq, READ_ONCE(ei->i_blkseq) + 1);
> +}
> +

Please add a comment here (and assertion as well) that updates of i_blkseq
are protected by ei->i_truncate_mutex. Reads can race at any moment to
that's the reason why WRITE_ONCE() is used. You can remove READ_ONCE() here
as it is pointless (we are locked here).

>  static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>  				 struct inode *inode, loff_t offset,
>  				 unsigned len)
>  {
> -	if (offset >= wpc->iomap.offset &&
> -	    offset < wpc->iomap.offset + wpc->iomap.length)
> +	loff_t maxblocks = (loff_t)INT_MAX;
> +	u8 blkbits = inode->i_blkbits;
> +	u32 bno;
> +	bool new, boundary;
> +	int ret;
> +
> +	if (ext2_imap_valid(wpc, inode, offset))
>  		return 0;
>  
> -	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> +	/*
> +	 * For ext2 buffered-writes, the block allocation happens at the
> +	 * ->write_iter time itself. So at writeback time -
> +	 * 1. We first cache the i_blkseq.
> +	 * 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
> +	 *    already allocated.
> +	 * 3. Call ext2_get_blocks() the second time with length to be same as
> +	 *    the no. of blocks we know were already allocated.
> +	 * 4. Till now it means, the cached i_blkseq remains valid as no block
> +	 *    allocation has happened yet.
> +	 * This means the next call to ->map_blocks(), we can verify whether the
> +	 * i_blkseq has raced with truncate or not. If not, then i_blkseq will
> +	 * remain valid.
> +	 *
> +	 * In case of a hole (could happen with mmaped writes), we only allocate
> +	 * 1 block at a time anyways. So even if the i_blkseq value changes, we
> +	 * anyway need to allocate the next block in subsequent ->map_blocks()
> +	 * call.
> +	 */

I suspect it would be tidier to move this logic into ext2_get_blocks()
itself but I guess it is ok as is for now.

> +	wpc->iomap.validity_cookie = READ_ONCE(EXT2_I(inode)->i_blkseq);
> +
> +	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> +			      &bno, &new, &boundary, 0);
> +	if (ret < 0)
> +		return ret;
> +	/*
> +	 * ret can be 0 in case of a hole which is possible for mmaped writes.
> +	 */
> +	ret = ret ? ret : 1;
> +	return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
>  				IOMAP_WRITE, &wpc->iomap, NULL);
>  }
>  
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 37f7ce56adce..32f5386284d6 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
>  #ifdef CONFIG_QUOTA
>  	memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
>  #endif
> -
> +	WRITE_ONCE(ei->i_blkseq, 0);
>  	return &ei->vfs_inode;

No need for write once here. This cannot race with anything...

									Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR