[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d20fadedac56247b3a92c651737b4e8bda05a879.camel@gmail.com>
Date: Wed, 04 Sep 2024 15:51:31 +0800
From: Julian Sun <sunjunchao2870@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-ext4@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, brauner@...nel.org, djwong@...nel.org, hch@....de,
syzbot+296b1c84b9cbf306e5a0@...kaller.appspotmail.com
Subject: Re: [PATCH] iomap: clean preallocated blocks in iomap_end() when 0
bytes was written.
On Wed, 2024-09-04 at 10:35 +1000, Dave Chinner wrote:
Hi, Dave. Thanks for your review and comments.
> On Tue, Sep 03, 2024 at 01:48:08PM +0800, Julian Sun wrote:
> > Hi, all.
> >
> > Recently, syzbot reported a issue as following:
> >
> > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 __iomap_write_begin fs/iomap/buffered-io.c:727 [inline]
> > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830
> > CPU: 1 UID: 0 PID: 5222 Comm: syz-executor247 Not tainted 6.11.0-rc2-syzkaller-00111-gee9a43b7cfe2 #0
> > RIP: 0010:__iomap_write_begin fs/iomap/buffered-io.c:727 [inline]
> > RIP: 0010:iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830
> > Call Trace:
> > <TASK>
> > iomap_unshare_iter fs/iomap/buffered-io.c:1351 [inline]
> > iomap_file_unshare+0x460/0x780 fs/iomap/buffered-io.c:1391
> > xfs_reflink_unshare+0x173/0x5f0 fs/xfs/xfs_reflink.c:1681
> > xfs_file_fallocate+0x6be/0xa50 fs/xfs/xfs_file.c:997
> > vfs_fallocate+0x553/0x6c0 fs/open.c:334
> > ksys_fallocate fs/open.c:357 [inline]
> > __do_sys_fallocate fs/open.c:365 [inline]
> > __se_sys_fallocate fs/open.c:363 [inline]
> > __x64_sys_fallocate+0xbd/0x110 fs/open.c:363
> > do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7f2d716a6899
> >
> > syzbot constructed the following scenario: syzbot called the write()
> > system call, passed an illegal pointer, and attempted to write 0x1017
> > bytes, resulting in 0 bytes written and returning EFAULT to user
> > space. Then, it called the write() system call again, passed another
> > illegal pointer, and attempted to write 0xfea7 bytes, resulting in
> > 0xe00 bytes written. Finally called copy_file_range() sys call and
> > fallocate() sys call with FALLOC_FL_UNSHARE_RANGE flag.
> >
> > What happened here is: during the first write, xfs_buffered_write_iomap_begin()
> > used preallocated 512 blocks, inserted an extent with a length of 512 and
> > reserved 512 blocks in the quota, with the iomap length being 1M.
>
> Why did XFS preallocate 512 blocks? The file was opened O_TRUNC, so
> it should be zero length and a write of just over 4100 bytes will
> not trigger speculative prealloc on a zero length file (threshold is
> 64kB). Indeed, the first speculative prealloc will only be 64kB in
> size...
Perhaps my wording was misleading. This might not be actual preallocation,
but rather it reached the following statement.
if (xfs_has_allocsize(mp))
prealloc_blocks = mp->m_allocsize_blocks;
And mp->m_allocsize_blocks is 512.
>
> Hence it's not immediately obvious what precondition is causing this
> behaviour to occur.
>
> > However, when the write failed(0 byte was written), only 0x1017 bytes were
> > passed to iomap_end() instead of the preallocated 1M bytes/512 blocks.
> > This caused only 3 blocks to be unreserved for the quota in iomap_end(),
> > instead of 512, and the corresponding extent information also only removed
> > 3 blocks instead of 512.
>
> Why 3 blocks? what was the filesystem block size? 2kB?
Exactly. GDB observed that sb_blocksize was 2048 and sb_blocklog was 11.
>
> This also smells of delayed allocation, because if it were
> -preallocation- it would be unwritten extents and we don't punch
> them out on write failures. See my first question....
>
> Regardless, the behaviour is perfectly fine. The remainder of the
> blocks are beyond EOF, and so have no impact on anything at this
> point in time.
>
> > As a result, during the second write, the iomap length was 3 blocks
> > instead of the expected 512 blocks,
>
> What was the mapping? A new delalloc extent from 0 to 6kB? If so,
> that is exactly as expected, because there's a hole in the file
> there.
Yeah. There's a delalloc extent from 0 to 6KB.
And below is the log of extent changes during the reproduction
process before the panic.
[ 24.466778][ T2491] write begin for ino 9292 offset 0 count 4119 offset_fsb 0 end_fsb 3
[ 24.467325][ T2491] insert extent pos:0 off:0 count:512 start:4503599627239434
[ 24.467909][ T2491] iomap->offset is 0 iomap->length is 1048576
[ 24.468603][ T2491] update extent before pos:0 off:0 count:512 start:4503599627239434
[ 24.469089][ T2491] update extent after pos:0 off:3 count:509 start:4503599627239434
write: Bad address
[ 24.469734][ T2491] write begin for ino 9292 offset 0 count 65191 offset_fsb 0 end_fsb 32
[ 24.470206][ T2491] update extent before pos:0 off:3 count:509 start:4503599627239434
[ 24.470673][ T2491] update extent after pos:0 off:0 count:512 start:4503599627239434
[ 24.471110][ T2491] iomap->offset is 0 iomap->length is 6144
[ 24.471916][ T2491] update extent before pos:0 off:0 count:512 start:4503599627239434
[ 24.472382][ T2491] update extent after pos:0 off:0 count:2 start:4503599627239429
[ 24.472838][ T2491] insert extent pos:1 off:3 count:509 start:4503599627239430
[ 24.473270][ T2491] write begin for ino 9292 offset 3584 count 61607 offset_fsb 1 end_fsb 32
write ret is 0xe00
[ 24.474517][ T2491] update extent before pos:0 off:0 count:2 start:4503599627239429
[ 24.474994][ T2491] update extent after pos:0 off:0 count:2 start:13
[ 24.476451][ T8] update extent before pos:0 off:0 count:2 start:13
[ 24.476902][ T8] update extent after pos:0 off:0 count:2 start:13
[ 24.478880][ T2491] insert extent pos:0 off:0 count:2 start:13
copy_file_range ret is 0xe00
[ 24.481463][ T2491] write begin for ino 9292 offset 0 count 8192 offset_fsb 0 end_fsb 4
[ 24.481968][ T2491] insert extent pos:0 off:0 count:32 start:4503599627239430
[ 24.482422][ T2491] write begin for ino 9292 offset 4096 count 4096 offset_fsb 2 end_fsb 4
[ 24.482882][ T2491] write begin for ino 9292 offset 6144 count 2048 offset_fsb 3 end_fsb 4
Note the last two logs: fallocate() successfully unshared 4k bytes,
then skipped the last 2k bytes because they were not being shared,
and then attempted to unshare 2k bytes beyond EOF.
>
> > which ultimately triggered the
> > issue reported by syzbot in the fallocate() system call.
>
> How? We wrote 3584 bytes, which means the first 2 blocks were
> written and marked dirty, and the third block should have been
> punched out by the same process that punched the delalloc blocks in
> the first write. We end up with a file size of 3584 bytes, and
> a delalloc mapping for those first two blocks followed by a hole,
> followed by another delalloc extent beyond EOF.
>
> There is absolutely nothing incorrect about this state - this is
> exactly how we want delalloc beyond EOF to be handled. i.e. it
> remains in place until it is explicitly removed. An normal
> application would fix the write buffer and rewrite the data, thereby
> using the space beyond EOF that we've already preallocated.
Got it. Thanks for your explanation.
>
> IOWs, this change in this patch is papering over the issue by
> preventing short writes from leaving extents beyond EOF, rather than
> working out why either CFR or UNSHARE is going wrong when there are
> extents beyond EOF.
Yeah, totally agreed.
>
> So, onto the copy_file_range() bit.
>
> Then CFR was called with a length of 0xffffffffa003e45bul, which is
> almost 16EB in size. This should result in -EOVERFLOW, because that
> is out of the range that an loff_t can represent.
>
> i.e. generic_copy_file_checks() does this:
>
> if (pos_in + count < pos_in || pos_out + count < pos_out)
> return -EOVERFLOW;
>
> and pos_in is a loff_t which is signed. Hence (0 + 15EB) should
> overflow to a large negative offset and be less than 0. Implicit
> type casting rules always break my brain - this looks like it is
> casting everything to unsigned, thereby not actually checking if
> we're overflowing the max offset the kernel can operate on.
Yes! The check was supposed to fail here, but it didn't. Maybe what
we should do is like this:
if (pos_in + (loff_t)count < pos_in || pos_out + (loff_t)count < pos_out)
return -EOVERFLOW;
My tests showed that CFR returned -EOVERFLOW with the change.
A search for the keyword "Ensure offsets don't wrap" shows that there is
also this kind of implicit conversion issue in generic_remap_checks().
And xfs_exchange_range_checks() uses check_add_overflow() to check for
overflow. Clearly, the latter is a more elegant way of checking.
I'm wondering if we have any tools or can we create a tool to check
for such implicit conversions. GCC provides -Wconversion, but it reports
too many issues, and people might get overwhelmed by such information.
Or can we find a method to filter the warning that we really need to fix?
>
> This oversize length is not caught by a check against max file size
> supported by the superblock, either,, because the count gets
> truncated to EOF before the generic checks against supported maximum
> file sizes are done.
>
> That seems ... wrong. Look at fallocate() - after checking for
> overflow, it checks offset + len against inode->i_sb->s_maxbytes and
> returns -EFBIG at that point.
In this case, offset is 0 and len is 8192, so the check in fallocate()
should passed.
>
> IOWs, I think CFR should be doing nothing but returning either
> -EOVERFLOW of -EFBIG here because the length is way longer than
> maximum supported file offset for any file operation in Linux. Then
> unshare should do nothing because the file is not shared and should
> not have the reflink flag set on it.....
>
> However, the strace indicates that:
>
> copy_file_range(6, [0], 5, NULL, 18446744072099193947, 0) = 3584
>
> That it is copying 3584 bytes. i.e. the overflow check is broken.
> It indicates, however, that the file size is as expected from the
> the short writes that occurred.
>
> Hence the unshare operation should only be operating on the range of
> 0-3584 bytes because that is the only possible range of the file
> that is shared.
>
> However, the fallocate() call is for offset 0, length 0x2000 bytes
> (8kB), and these ranges are passed directly from XFS to
> iomap_file_unshare(). iomap_file_unshare() never checks the range
> against EOF, either, and so we've passed a range beyond EOF to
> iomap_file_unshare() for it to process. That seems ... wrong.
>
> Hence the unshare does:
>
> first map: 0 - 4k - unshare successfully.
> second map: 4k - 6k - hole, skip. Beyond EOF.
> third map: 6k - 8k - delalloc, beyond EOF so needs zeroing.
> Fires warnings because UNSHARE.
Yes. A slight difference: the second mapping was skipped because it wasn't
marked with IOMAP_F_SHARED.
>
> IOWs, iomap_file_unshare() will walk beyond EOF blissfully unaware
> it is trying to unshare blocks that cannot ever be shared first
> place because reflink will not share blocks beyond EOF. And because
> those blocks are beyond EOF, they will always trigger the "need
> zeroing" case in __iomap_write_begin() and fire warnings.
>
> So, yeah, either xfs_file_fallocate() or iomap_file_unshare() need
> to clamp the range being unshared to EOF.
>
> Hence this looks like a couple of contributing issues that need to
> be fixed:
>
> - The CFR overflow checks look broken.
> - The unshare range is never trimmed to EOF.
Exactly. The first can be fixed with using check_add_overflow() macro.
The latter one maybe can be fixed with the following patch:
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86ac..8898d5ec606f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1340,6 +1340,9 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
/* don't bother with holes or unwritten extents */
if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
return length;
+ /* don't try to unshare any extents that beyond EOF. */
+ if (pos > i_size_read(iter->inode))
+ return length;
do {
struct folio *folio;
>
> but it's definitely not a bug caused by short writes leaving
> delalloc extents beyond EOF. There are many, many ways that
> we can create delalloc extents beyond EOF before running an UNSHARE
> operation that can trip over them like this.
>
I see. Thanks again for your detailed and patient explanation.
> -Dave.
Thanks,
--
Julian Sun <sunjunchao2870@...il.com>
Powered by blists - more mailing lists