[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240512132003.b6o4ghqk5yjsvkpy@quack3>
Date: Sun, 12 May 2024 15:20:03 +0200
From: Jan Kara <jack@...e.cz>
To: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
Cc: linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org,
linux-fsdevel@...r.kernel.org, Matthew Wilcox <willy@...radead.org>,
"Darrick J . Wong" <djwong@...nel.org>,
Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Jan Kara <jack@...e.cz>
Subject: Re: [RFCv3 4/7] ext2: Implement seq counter for validating cached
iomap
On Thu 25-04-24 18:58:48, Ritesh Harjani (IBM) wrote:
> There is a possibility of following race with iomap during
> writebck -
>
> write_cache_pages()
> cache extent covering 0..1MB range
> write page at offset 0k
> truncate(file, 4k)
> drops all relevant pages
> frees fs blocks
> pwrite(file, 4k, 4k)
> creates dirty page in the page cache
> writes page at offset 4k to a stale block
>
> This race can happen because iomap_writepages() keeps a cached extent mapping
> within struct iomap. While write_cache_pages() is going over each folio,
> (can cache a large extent range), if a truncate happens in parallel on the
> next folio followed by a buffered write to the same offset within the file,
> this can change logical to physical offset of the cached iomap mapping.
> That means, the cached iomap has now become stale.
>
> This patch implements the seq counter approach for revalidation of stale
> iomap mappings. i_blkseq will get incremented for every block
> allocation/free. Here is what we do -
>
> For ext2 buffered-writes, the block allocation happens at the
> ->write_iter time itself. So at writeback time,
> 1. We first cache the i_blkseq.
> 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
> already allocated.
> 3. Call ext2_get_blocks() the second time with length to be same as
> the no. of blocks we know were already allocated.
> 4. Till now it means, the cached i_blkseq remains valid as no block
> allocation has happened yet.
> This means the next call to ->map_blocks(), we can verify whether the
> i_blkseq has raced with truncate or not. If not, then i_blkseq will
> remain valid.
>
> In case of a hole (could happen with mmaped writes), we only allocate
> 1 block at a time anyways. So even if the i_blkseq value changes right
> after, we anyway need to allocate the next block in subsequent
> ->map_blocks() call.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
A few small comments below.
> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
> return container_of(inode, struct ext2_inode_info, vfs_inode);
> }
>
> +static inline void ext2_inc_i_blkseq(struct ext2_inode_info *ei)
> +{
> + WRITE_ONCE(ei->i_blkseq, READ_ONCE(ei->i_blkseq) + 1);
> +}
> +
Please add a comment here (and assertion as well) that updates of i_blkseq
are protected by ei->i_truncate_mutex. Reads can race at any moment to
that's the reason why WRITE_ONCE() is used. You can remove READ_ONCE() here
as it is pointless (we are locked here).
> static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> struct inode *inode, loff_t offset,
> unsigned len)
> {
> - if (offset >= wpc->iomap.offset &&
> - offset < wpc->iomap.offset + wpc->iomap.length)
> + loff_t maxblocks = (loff_t)INT_MAX;
> + u8 blkbits = inode->i_blkbits;
> + u32 bno;
> + bool new, boundary;
> + int ret;
> +
> + if (ext2_imap_valid(wpc, inode, offset))
> return 0;
>
> - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> + /*
> + * For ext2 buffered-writes, the block allocation happens at the
> + * ->write_iter time itself. So at writeback time -
> + * 1. We first cache the i_blkseq.
> + * 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
> + * already allocated.
> + * 3. Call ext2_get_blocks() the second time with length to be same as
> + * the no. of blocks we know were already allocated.
> + * 4. Till now it means, the cached i_blkseq remains valid as no block
> + * allocation has happened yet.
> + * This means the next call to ->map_blocks(), we can verify whether the
> + * i_blkseq has raced with truncate or not. If not, then i_blkseq will
> + * remain valid.
> + *
> + * In case of a hole (could happen with mmaped writes), we only allocate
> + * 1 block at a time anyways. So even if the i_blkseq value changes, we
> + * anyway need to allocate the next block in subsequent ->map_blocks()
> + * call.
> + */
I suspect it would be tidier to move this logic into ext2_get_blocks()
itself but I guess it is ok as is for now.
> + wpc->iomap.validity_cookie = READ_ONCE(EXT2_I(inode)->i_blkseq);
> +
> + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> + &bno, &new, &boundary, 0);
> + if (ret < 0)
> + return ret;
> + /*
> + * ret can be 0 in case of a hole which is possible for mmaped writes.
> + */
> + ret = ret ? ret : 1;
> + return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
> IOMAP_WRITE, &wpc->iomap, NULL);
> }
>
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 37f7ce56adce..32f5386284d6 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
> #ifdef CONFIG_QUOTA
> memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
> #endif
> -
> + WRITE_ONCE(ei->i_blkseq, 0);
> return &ei->vfs_inode;
No need for write once here. This cannot race with anything...
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists