[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aNWRjekDBHRelmbS@bfoster>
Date: Thu, 25 Sep 2025 15:01:33 -0400
From: Brian Foster <bfoster@...hat.com>
To: alexjlzheng@...il.com
Cc: brauner@...nel.org, djwong@...nel.org, hch@...radead.org,
kernel@...kajraghav.com, linux-xfs@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
yi.zhang@...wei.com, Jinliang Zheng <alexjlzheng@...cent.com>
Subject: Re: [PATCH v5 4/4] iomap: don't abandon the whole copy when we have
iomap_folio_state
On Tue, Sep 23, 2025 at 12:21:58PM +0800, alexjlzheng@...il.com wrote:
> From: Jinliang Zheng <alexjlzheng@...cent.com>
>
> Currently, if a partial write occurs in a buffer write, the entire write will
> be discarded. While this is an uncommon case, it's still a bit wasteful and
> we can do better.
>
> With iomap_folio_state, we can identify uptodate states at the block
> level, and a read_folio reading can correctly handle partially
> uptodate folios.
>
> Therefore, when a partial write occurs, accept the block-aligned
> partial write instead of rejecting the entire write.
>
> For example, suppose a folio is 2MB, blocksize is 4kB, and the copied
> bytes are 2MB-3kB.
>
> Without this patchset, we'd need to recopy from the beginning of the
> folio in the next iteration, which means 2MB-3kB of bytes is copy
> duplicately.
>
> |<-------------------- 2MB -------------------->|
> +-------+-------+-------+-------+-------+-------+
> | block | ... | block | block | ... | block | folio
> +-------+-------+-------+-------+-------+-------+
> |<-4kB->|
>
> |<--------------- copied 2MB-3kB --------->| first time copied
> |<-------- 1MB -------->| next time we need copy (chunk /= 2)
> |<-------- 1MB -------->| next next time we need copy.
>
> |<------ 2MB-3kB bytes duplicate copy ---->|
>
> With this patchset, we can accept 2MB-4kB of bytes, which is block-aligned.
> This means we only need to process the remaining 4kB in the next iteration,
> which means there's only 1kB we need to copy duplicately.
>
> |<-------------------- 2MB -------------------->|
> +-------+-------+-------+-------+-------+-------+
> | block | ... | block | block | ... | block | folio
> +-------+-------+-------+-------+-------+-------+
> |<-4kB->|
>
> |<--------------- copied 2MB-3kB --------->| first time copied
> |<-4kB->| next time we need copy
>
> |<>|
> only 1kB bytes duplicate copy
>
> Although partial writes are inherently a relatively unusual situation and do
> not account for a large proportion of performance testing, the optimization
> here still makes sense in large-scale data centers.
>
Thanks for the nice writeup and diagrams.
> Signed-off-by: Jinliang Zheng <alexjlzheng@...cent.com>
> ---
> fs/iomap/buffered-io.c | 44 +++++++++++++++++++++++++++++++++---------
> 1 file changed, 35 insertions(+), 9 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6e516c7d9f04..3304028ce64f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -873,6 +873,25 @@ static int iomap_write_begin(struct iomap_iter *iter,
> return status;
> }
>
> +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
> + size_t copied, struct folio *folio)
> +{
> + struct iomap_folio_state *ifs = folio->private;
> + unsigned block_size, last_blk, last_blk_bytes;
> +
> + if (!ifs || !copied)
> + return 0;
> +
> + block_size = 1 << inode->i_blkbits;
I'd move this assignment to declaration time.
> + last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
> + last_blk_bytes = (pos + copied) & (block_size - 1);
> +
> + if (!ifs_block_is_uptodate(ifs, last_blk))
> + copied -= min(copied, last_blk_bytes);
So I think I follow the idea here and it seems reasonable at first
glance. IIUC, the high level issue is that for certain writes we don't
read blocks up front if the write is expected to fully overwrite
blocks/folios, as we can just mark things uptodate on write completion.
If the write is short however, we now have a partial write to a
!uptodate block, so have to toss the write.
A few initial thoughts..
1. I don't really love the function name here. Maybe something like
iomap_write_end_short() or something would be more clear, but maybe
there are other opinions.
2. It might be helpful to move some of the comment below up to around
here where we actually trim the copied value.
3. I see that in __iomap_write_begin() we don't necessarily always
attach ifs if a write is expected to fully overwrite the entire folio.
It looks like that is handled with the !ifs check above, but it also
makes me wonder how effective this change is.
For example, the example in the commit log description appears to be a
short write of an attempted overwrite of a 2MB folio, right? Would we
expect to have ifs in that situation?
I don't really object to having the logic even if it is a real corner
case, but it would be good to have some test coverage to verify
behavior. Do you have a test case or anything (even if contrived) along
those lines? Perhaps we could play some games with badly formed
syscalls. A quick test to call pwritev() with a bad iov_base pointer
seems to produce a short write, but I haven't confirmed that's
sufficient for testing here..
Brian
> +
> + return copied;
> +}
> +
> static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> size_t copied, struct folio *folio)
> {
> @@ -881,17 +900,24 @@ static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> /*
> * The blocks that were entirely written will now be uptodate, so we
> * don't have to worry about a read_folio reading them and overwriting a
> - * partial write. However, if we've encountered a short write and only
> - * partially written into a block, it will not be marked uptodate, so a
> - * read_folio might come in and destroy our partial write.
> + * partial write.
> *
> - * Do the simplest thing and just treat any short write to a
> - * non-uptodate page as a zero-length write, and force the caller to
> - * redo the whole thing.
> + * However, if we've encountered a short write and only partially
> + * written into a block, we must discard the short-written _tail_ block
> + * and not mark it uptodate in the ifs, to ensure a read_folio reading
> + * can handle it correctly via iomap_adjust_read_range(). It's safe to
> + * keep the non-tail block writes because we know that for a non-tail
> + * block:
> + * - is either fully written, since copy_from_user() is sequential
> + * - or is a partially written head block that has already been read in
> + * and marked uptodate in the ifs by iomap_write_begin().
> */
> - if (unlikely(copied < len && !folio_test_uptodate(folio)))
> - return 0;
> - iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
> + if (unlikely(copied < len && !folio_test_uptodate(folio))) {
> + copied = iomap_trim_tail_partial(inode, pos, copied, folio);
> + if (!copied)
> + return 0;
> + }
> + iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), copied);
> iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied);
> filemap_dirty_folio(inode->i_mapping, folio);
> return copied;
> --
> 2.49.0
>
>
Powered by blists - more mailing lists