lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZeERAob9Imwh01bG@dread.disaster.area>
Date: Fri, 1 Mar 2024 10:19:30 +1100
From: Dave Chinner <david@...morbit.com>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: Christoph Hellwig <hch@...radead.org>, djwong@...nel.org,
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	linux-xfs@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
	jack@...e.cz, ritesh.list@...il.com, willy@...radead.org,
	zokeefe@...gle.com, yi.zhang@...wei.com, chengzhihao1@...wei.com,
	yukuai3@...wei.com, wangkefeng.wang@...wei.com
Subject: Re: [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a
 write operation

On Thu, Feb 29, 2024 at 04:59:34PM +0800, Zhang Yi wrote:
> Hello, Dave!
> 
> On 2024/2/29 6:25, Dave Chinner wrote:
> > On Wed, Feb 28, 2024 at 04:53:32PM +0800, Zhang Yi wrote:
> >> On 2024/2/13 13:46, Christoph Hellwig wrote:
> >>> Wouldn't it make more sense to just move the size manipulation to the
> >>> write-only code?  An untested version of that is below.  With this
> >>> the naming of the status variable becomes even more confusing than
> >>> it already is, maybe we need to do a cleanup of the *_write_end
> >>> calling conventions as it always returns the passed in copied value
> >>> or 0.
> >>>
> >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >>> index 3dab060aed6d7b..8401a9ca702fc0 100644
> >>> --- a/fs/iomap/buffered-io.c
> >>> +++ b/fs/iomap/buffered-io.c
> >>> @@ -876,34 +876,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
> >>>  		size_t copied, struct folio *folio)
> >>>  {
> >>>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> >>> -	loff_t old_size = iter->inode->i_size;
> >>> -	size_t ret;
> >>> -
> >>> -	if (srcmap->type == IOMAP_INLINE) {
> >>> -		ret = iomap_write_end_inline(iter, folio, pos, copied);
> >>> -	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> >>> -		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
> >>> -				copied, &folio->page, NULL);
> >>> -	} else {
> >>> -		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
> >>> -	}
> >>> -
> >>> -	/*
> >>> -	 * Update the in-memory inode size after copying the data into the page
> >>> -	 * cache.  It's up to the file system to write the updated size to disk,
> >>> -	 * preferably after I/O completion so that no stale data is exposed.
> >>> -	 */
> >>> -	if (pos + ret > old_size) {
> >>> -		i_size_write(iter->inode, pos + ret);
> >>> -		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
> >>> -	}
> >>
> >> I've recently discovered that if we don't increase i_size in
> >> iomap_zero_iter(), it would break fstests generic/476 on xfs. xfs
> >> depends on iomap_zero_iter() to increase i_size in some cases.
> >>
> >>  generic/476 75s ... _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
> >>  (see /home/zhangyi/xfstests-dev/results//xfs/generic/476.full for details)
> >>
> >>  _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
> >>  *** xfs_repair -n output ***
> >>  Phase 1 - find and verify superblock...
> >>  Phase 2 - using internal log
> >>          - zero log...
> >>          - scan filesystem freespace and inode maps...
> >>  sb_fdblocks 10916, counted 10923
> >>          - found root inode chunk
> >>  ...
> >>
> >> After debugging and analysis, I found the root cause of the problem is
> >> related to the pre-allocations of xfs. xfs pre-allocates some blocks to
> >> reduce fragmentation during buffer append writing, then if we write new
> >> data or do file copy(reflink) after the end of the pre-allocating range,
> >> xfs would zero-out and write back the pre-allocate space(e.g.
> >> xfs_file_write_checks() -> xfs_zero_range()), so we have to update
> >> i_size before writing back in iomap_zero_iter(), otherwise, it will
> >> result in stale delayed extent.
> > 
> > Ok, so this is long because the example is lacking in clear details
> > so to try to understand it I've laid it out in detail to make sure
> > I've understood it correctly.
> > 
> 
> Thanks for the graph, the added detail makes things clear and easy to
> understand. To be honest, it's not exactly the same as the results I
> captured and described (the position A\B\C\D\E\F I described is
> increased one by one), but the root cause of the problem is the same,
> so it doesn't affect our understanding of the problem.

OK, good :)

....

> > However, if this did actually write zeroes to disk, this would end
> > up with:
> > 
> > 	A          C           B     E       F      D
> > 	+wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+
> > 	                      EOF   EOF
> >                       (in memory)   (on disk)
> > 
> > Which is wrong - the file extension and zeros should not get exposed
> > to the user until the entire reflink completes. This would expose
> > zeros at the EOF and a file size that the user never asked for after
> > a crash. Experience tells me that they would report this as
> > "filesystem corrupting data on crash".
> > 
> > If we move where i_size gets updated by iomap_zero_iter(), we get:
> > 
> > 	A          C           B     E       F      D
> > 	+wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+
> > 	                            EOF
> >                                 (in memory)
> > 		                 (on disk)
> > 
> > Which is also wrong, because now the user can see the size change
> > and read zeros in the middle of the clone operation, which is also
> > wrong.
> > 
> > IOWs, we do not want to move the in-memory or on-disk EOF as a
> > result of zeroing delalloc extents beyond EOF as it opens up
> > transient, non-atomic on-disk states in the event of a crash.
> > 
> > So, catch-22: we need to move the in-memory EOF to write back zeroes
> > beyond EOF, but that would move the on-disk EOF to E before the
> > clone operation starts. i.e. it makes clone non-atomic.
> 
> Make sense. IIUC, I also notice that xfs_file_write_checks() zero
> out EOF blocks if the later write offset is beyond the size of the
> file.  Think about if we replace the reflink operation to a buffer
> write E to F, although it doesn't call xfs_flush_unmap_range()
> directly, but if it could be raced by another background write
> back, and trigger the same problem (I've not try to reproduce it,
> so please correct me if I understand wrong).

Correct, but the write is about to extend the file size when it
writes into the cache beyond the zeroed region. There is no cache
invalidate possible in this path, so the write of data moves the
in-memory EOF past the zeroes in cache and everything works just
fine.

If it races with concurrent background writeback, the writeback will
skip the zeroed range beyond EOF until they are exposed by the first
data write beyond the zeroed post-eof region which moves the
in-memory EOF.

truncate(to a larger size) also does this same zeroing - the page
cache is zeroed before we move the EOF in memory, and so the
writeback will only occur once the in-memory EOF is moved. i.e. it
effectively does:

	xfs_zero_range(oldsize to newsize)
	truncate_setsize(newsize)
	filemap_write_and_wait_range(old size to new size)

> > What should acutally result from the iomap_zero_range() call from
> > xfs_reflink_remap_prep() is a state like this:
> > 
> > 	A          C           B     E       F      D
> > 	+wwwwwwwwww+DDDDDDDDDDD+uuuuu+rrrrrrr+dddddd+
> > 	          EOF         EOF
> >                (on disk)  (in memory)
> > 
> > where 'u' are unwritten extent blocks.
> > 
> 
> Yeah, this is a good solution.
> 
> In xfs_file_write_checks(), I don't fully understand why we need
> the xfs_zero_range().

The EOF block may only be partially written. Hence on extension, we
have to guarantee the part of that block beyond the current EOF is
zero if the write leaves a hole between the current EOF and the
start of the new extending write.

> Theoretically, iomap have already handled
> partial block zeroing for both buffered IO and DIO, so I guess
> the only reason we still need it is to handle pre-allocated blocks
> (no?).

Historically speaking, Linux is able to leak data beyond EOF on
writeback of partial EOF blocks (e.g. mmap() can write to the EOF
page beyond EOF without failing. We try to mitigate these cases
where we can, but we have to consider that at any time the data in
the cache beyond EOF can be non-zero thanks to mmap() and so any
file extension *must* zero any region beyond EOF cached in the page
cache.

> If so,would it be better to call xfs_free_eofblocks() to
> release all the preallocated extents in range? If not, maybe we
> could only zero out mapped partial blocks and also release
> preallocated extents?

No, that will cause all sorts of other performance problems,
especially for reflinked files that triggering COW
operations...

> 
> In xfs_reflink_remap_prep(), I read the commit 410fdc72b05a ("xfs:
> zero posteof blocks when cloning above eof"), xfs used to release
> preallocations, the change log said it didn't work because of the
> PREALLOC flag, but the 'force' parameter is 'true' when calling
> xfs_can_free_eofblocks(), so I don't get the problem met. Could we
> fall back to use xfs_free_eofblocks() and make a state like this?
> 
>  	A          C           B     E       F      D
>  	+wwwwwwwwww+DDDDDDDDDDD+hhhhh+rrrrrrr+dddddd+
>  	          EOF         EOF
>                 (on disk)  (in memory)

It could, but that then requires every place that may call
xfs_zero_range() to be aware of this need to trim EOF blocks to do
the right thing in all cases. We don't want to remove speculative
delalloc in the write() path nor in the truncate(up) case, and so it
doesn't fix the general problem of zeroing specualtive delalloc
beyond EOF requiring writeback to push page caceh pages to disk
before the inode size has been updated.

The general solution is to have zeroing of speculative prealloc
extents beyond EOF simply convert the range to unwritten and then
invalidate any cached pages over that range. At this point, we are
guaranteed to have zeroes across that range, all without needing to
do any IO at all...

-Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ