[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201125215247.GD2842436@dread.disaster.area>
Date: Thu, 26 Nov 2020 08:52:47 +1100
From: Dave Chinner <david@...morbit.com>
To: Sasha Levin <sashal@...nel.org>
Cc: linux-kernel@...r.kernel.org, stable@...r.kernel.org,
Dave Chinner <dchinner@...hat.com>,
Jens Axboe <axboe@...nel.dk>,
"Darrick J . Wong" <darrick.wong@...cle.com>,
linux-xfs@...r.kernel.org
Subject: Re: [PATCH AUTOSEL 5.9 33/33] xfs: don't allow NOWAIT DIO across
extent boundaries
On Wed, Nov 25, 2020 at 10:35:50AM -0500, Sasha Levin wrote:
> From: Dave Chinner <dchinner@...hat.com>
>
> [ Upstream commit 883a790a84401f6f55992887fd7263d808d4d05d ]
>
> Jens has reported a situation where partial direct IOs can be issued
> and completed yet still return -EAGAIN. We don't want this to report
> a short IO as we want XFS to complete user DIO entirely or not at
> all.
>
> This partial IO situation can occur on a write IO that is split
> across an allocated extent and a hole, and the second mapping is
> returning EAGAIN because allocation would be required.
>
> The trivial reproducer:
>
> $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
> wrote 4096/4096 bytes at offset 0
> 4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
> pwrite: Resource temporarily unavailable
> $
>
> The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
> the first 4kB write:
>
> xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
> iomap_apply: dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
> xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
> xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
> xfs_iomap_found: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
> iomap_apply_dstmap: dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY
>
> Here the first iomap loop has mapped the first 4kB of the file and
> issued the IO, and we enter the second iomap_apply loop:
>
> iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
> xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
> xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
>
> And we exit with -EAGAIN out because we hit the allocate case trying
> to make the second 4kB block.
>
> Then IO completes on the first 4kB and the original IO context
> completes and unlocks the inode, returning -EAGAIN to userspace:
>
> xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
> xfs_iunlock: dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write
>
> There are other vectors to the same problem when we re-enter the
> mapping code if we have to make multiple mappinfs under NOWAIT
> conditions. e.g. failing trylocks, COW extents being found,
> allocation being required, and so on.
>
> Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
> to go ahead if the mapping we retrieve for the IO spans an entire
> allocated extent. This avoids the possibility of subsequent mappings
> to complete the IO from triggering NOWAIT semantics by any means as
> NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
>
> Reported-and-tested-by: Jens Axboe <axboe@...nel.dk>
> Signed-off-by: Dave Chinner <dchinner@...hat.com>
> Reviewed-by: Darrick J. Wong <darrick.wong@...cle.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@...cle.com>
> Signed-off-by: Sasha Levin <sashal@...nel.org>
> ---
> fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
No, please don't pick this up for stable kernels until at least 5.10
is released. This still needs some time integration testing time to
ensure that we haven't introduced any other performance regressions
with this change.
We've already had one XFS upstream kernel regression in this -rc
cycle propagated to the stable kernels in 5.9.9 because the stable
process picked up a bunch of random XFS fixes within hours of them
being merged by Linus. One of those commits was a result of a
thinko, and despite the fact we found it and reverted it within a
few days, users of stable kernels have been exposed to it for a
couple of weeks. That *should never have happened*.
This has happened before, and *again* we were lucky this wasn't
worse than it was. We were saved by the flaw being caught by own
internal pre-write corruption verifiers (which exist because we
don't trust our code to be bug-free, let alone the collections of
random, poorly tested backports) so that it only resulted in
corruption shutdowns rather than permanent on-disk damage and data
loss.
Put simply: the stable process is flawed because it shortcuts the
necessary stabilisation testing for new code. It doesn't matter if
the merged commits have a "fixes" tag in them, that tag doesn't mean
the change is ready to be exposed to production systems. We need the
*-rc stabilisation process* to weed out thinkos, brown paper bag
bugs, etc, because we all make mistakes, and bugs in filesystem code
can *lose user data permanently*.
Hence I ask that the stable maintainers only do automated pulls of
iomap and XFS changes from upstream kernels when Linus officially
releases them rather than at random points in time in the -rc cycle.
If there is a critical fix we need to go back to stable kernels
immediately, we will let stable@...nel.org know directly that we
want this done.
Cheers,
Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists