linux-kernel - Re: [PATCH] fuse: clarify extending writes handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250821062535.1498-1-luochunsheng@ustc.edu>
Date: Thu, 21 Aug 2025 14:25:35 +0800
From: Chunsheng Luo <luochunsheng@...c.edu>
To: djwong@...nel.org
Cc: linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	luochunsheng@...c.edu,
	miklos@...redi.hu
Subject: Re: [PATCH] fuse: clarify extending writes handling

On  Wed, 20 Aug 2025 09:27:24 Darrick J. Wong wrote:

> On Wed, Aug 20, 2025 at 08:52:35AM +0200, Miklos Szeredi wrote:
> > On Wed, 20 Aug 2025 at 07:20, Darrick J. Wong <djwong@...nel.org> wrote:
> > 
> > > I don't understand the current behavior at all -- why do the callers of
> > > fuse_writeback_range pass an @end parameter when it ignores @end in
> > > favor of LLONG_MAX?  And why is it necessary to flush to EOF at all?
> > > fallocate and copy_file_range both take i_rwsem, so what could they be
> > > racing with?  Or am I missing something here?
> > 
> > commit 59bda8ecee2f ("fuse: flush extending writes")
> > 
> > The issue AFAICS is that if writes beyond the range end are not
> > flushed, then EOF on backing file could be below range end (if pending
> > writes create a hole), hence copy_file_range() will stop copying at
> > the start of that hole.
> > 
> > So this patch is incorrect, since not flushing copy_file_range input
> > file could result in a short copy.
> 

Thanks to Miklos for the review and explanation.

> <nod> As far as Mr. Luo's patch is concerned, I agree that a strict "no
> behavior changes" patch should have changed the inode_in writeback_range
> call to:
> 
> 	err = fuse_writeback_range(inode_in, pos_in, LLONG_MAX);
> 
> Though if all callsites are going to pass LLONG_MAX in as @end, then
> why not eliminate the parameter entirely?
> 

Thanks for your reply.

Ok, understood. Before fully understanding why we need to flush up to the end,
let's first ensure the logic remains unchanged.
 
Rather than removing the end parameter from fuse_writeback_range and putting
LLONG_MAX inside the function, I suggest keeping the end parameter, modifying
the input argument to LLONG_MAX, and adding some comments. This way we can
more clearly see the range scope. Also, we cannot guarantee whether there
will be other scenarios that need the real_end in the future.

> What I'm (still) wondering is why was it necessary to flush the source
> and destination ranges between (pos + len - 1) and LLONG_MAX?  But let's
> see, what did 59bda8ecee2f have to say?
> 
> | fuse: flush extending writes
> |
> | Callers of fuse_writeback_range() assume that the file is ready for
> | modification by the server in the supplied byte range after the call
> | returns.
> 
> Ok, so far so good.
> 
> | If there's a write that extends the file beyond the end of the supplied
> | range, then the file needs to be extended to at least the end of the range,
> | but currently that's not done.
> |
> | There are at least two cases where this can cause problems:
> |
> |  - copy_file_range() will return short count if the file is not extended
> |    up to end of the source range.
> 
> That suggests to me
> 
> filemap_write_and_wait_range(inode_in, pos_in, pos_in + pos_len - 1)
> 
> but I don't see why we need to flush more bytes than that?  The server's
> CFR implementation has all the bytes it needs to read the source data.
> 
> Hum.  But what if CFR is actually reflink?  I guess you'd want to
> buffer-copy the unaligned head and tail regions, and reflink the
> allocation units in the middle, but I still don't see why the fuse
> server needs more of the source file than (pos, pos + len - 1)?
> 
> |  - FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file,
> |    hence the region may not be fully allocated.
> 
> Hrm, ZERO | KEEP_SIZE is supposed to allow preallocation of blocks
> beyond EOF, or at least that's what XFS does:
> 
> $ truncate -s 10m /mnt/test
> $ xfs_io -c 'fzero -k 100m 64k' /mnt/test
> $ filefrag -v /mnt/test
> Filesystem type is: 58465342
> File size of /mnt/test is 10485760 (2560 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:    25600..   25615:         24..        39:     16:      25600: last,unwritten,eof
> /mnt/test: 1 extent found
> 
> as does ext4:
> 
> $ truncate -s 10m /mnt/test
> $ xfs_io -c 'fzero -k 100m 64k' /mnt/test
> $ filefrag -v /mnt/test
> Filesystem type is: ef53
> File size of /mnt/test is 10485760 (2560 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:    25600..   25615:      33808..     33823:     16:      25600: last,unwritten,eof
> /mnt/test: 1 extent found
> 
> (Notice that the 10M file has one extent starting at 100M)
> 
> I can see why you'd want to flush the target range in case the fuse
> server has a better trick up its sleeve to zero the already-written
> region that isn't the punch-and-realloc behavior that xfs and ext4 have.
> But here too I don't see why the fuse server would need more than the
> target region.
> 
> Though I think for both cases we end up flushing more than the target
> region, because the page cache rounds start down and end up to PAGE_SIZE
> boundaries.
> 
> | Fix by flushing writes from the start of the range up to the end of the
> | file.  This could be optimized if the writes are non-extending, etc, but
> | it's probably not worth the trouble.
> 
> <shrug> Was there a bug report associated with this commit?  I couldn't
> find the any hits on the subject line in lore.  Was this simply a big
> hammer that solved whatever corruption problems were occuring?  Or
> something found in code inspection?
> 
> <confused>
> 
> --D
> 
> > Thanks,
> > Miklos
> > 

Regarding "The issue AFAICS is that if writes beyond the range end are not flushed, 
then EOF on backing file could be below range end (if pending writes create a hole), 
hence copy_file_range() will stop copying at the start of that hole."

I looked up some information from man and code

1. The man copy_file_range description:

"If fd_in is a sparse file, then copy_file_range() may expand any holes existing 
in the requested range. Users may benefit from calling copy_file_range() in a loop, 
and using the lseek(2) SEEK_DATA and SEEK_HOLE operations to find the locations of
data segments."

The man page description of 'If fd_in is a sparse file' clearly refers to the source
file being a sparse file (i.e., containing holes). In this case, copy_file_range may
expand holes (logical zero-byte regions) in the source file into actual written zero
bytes in the destination file (physically occupying disk space), causing the destination
file to lose its sparseness. This should refer to the case where holes exist within the
copy_from range of fd_in.

2. Looking at the corresponding code:
copy_file_range() -> do_splice_direct -> splice_direct_to_actor -> do_splice_read

do_splice_read:
do {
    if (*ppos >= i_size_read(in->f_mapping->host))
        break;  // Hit end of file, exit
		
    // filemap_get_pages encountering file holes will fill with zeros
    // Or is there a case where the filesystem returns failure when it encounters a hole?
    error = filemap_get_pages(&iocb, len, &fbatch, true); 
    if (error < 0)
        break;
    
    // Process each page, copy to pipe
    for (i = 0; i < folio_batch_count(&fbatch); i++) {
        n = splice_folio_into_pipe(pipe, folio, *ppos, n);
        if (!n)
            goto out;
			...
    }
} while (len);

I can understand that the [pos, pos+len) range needs to be flushed to the backing file
to avoid the FUSE userspace program mistakenly thinking that there are holes in the
backing file (file_in) or that the size is insufficient, which would cause the FUSE
userspace program to execute copy_file_range(back_file_in, back_file_out) and return
short copy or overwrite holes with zeros.

But I'm also confused why we need to flush beyond the [pos, pos+len) range?

Yes, are there any testcases or problem email discussions that would make it easier
to understand the reason? 

I'll continue to look at the code in detail combined with testing later.

Thanks
Chunsheng Luo