[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210415141450.GA12301@localhost.localdomain>
Date: Thu, 15 Apr 2021 10:14:50 -0400
From: Eric Whitney <enwlinux@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Ted Tso <tytso@....edu>, linux-ext4@...r.kernel.org,
Eric Whitney <enwlinux@...il.com>, stable@...r.kernel.org
Subject: Re: [PATCH v2] ext4: Fix occasional generic/418 failure
* Jan Kara <jack@...e.cz>:
> Eric has noticed that after pagecache read rework, generic/418 is
> occasionally failing for ext4 when blocksize < pagesize. In fact, the
> pagecache rework just made hard to hit race in ext4 more likely. The
> problem is that since ext4 conversion of direct IO writes to iomap
> framework (commit 378f32bab371), we update inode size after direct IO
> write only after invalidating page cache. Thus if buffered read sneaks
> at unfortunate moment like:
>
> CPU1 - write at offset 1k CPU2 - read from offset 0
> iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> ext4_readpage();
> ext4_handle_inode_extension()
>
> the read will zero out tail of the page as it still sees smaller inode
> size and thus page cache becomes inconsistent with on-disk contents with
> all the consequences.
>
> Fix the problem by moving inode size update into end_io handler which
> gets called before the page cache is invalidated.
>
> Reported-by: Eric Whitney <enwlinux@...il.com>
> Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
> CC: stable@...r.kernel.org
> Signed-off-by: Jan Kara <jack@...e.cz>
> ---
> fs/ext4/file.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> Eric, can you please try whether this patch fixes the failures you are
> occasionally seeing?
>
I applied this patch to 5.12-rc7 and ran tests on two different sets of
hardware using kvm-xfstests.
5000 runs of generic/418 passed on the 1k test configuration without
incident. On an unmodified -rc7 kernel, I'm typically getting 100-200 failures
for that many runs. So, that looks good.
After running through all the appliance's test configurations with the
patched kernel, I did see one unexpected failure of generic/068 on
data=journal due to a burst of kernel warnings. Unexpected means I haven't
seen this failure in recent memory on upstream -rc* regression runs. A
10 run attempt to reproduce the failure failed to do so. An instance of the
warning follows.
Thanks,
Eric
[35456.178585] WARNING: CPU: 1 PID: 22225 at fs/ext4/ext4_jbd2.c:75 ext4_journal_check_start+0x4c/0x90
[35456.180802] Modules linked in:
[35456.181554] CPU: 1 PID: 22225 Comm: kworker/u4:0 Not tainted 5.12.0-rc7+ #2
[35456.183169] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[35456.185030] Workqueue: writeback wb_workfn (flush-254:32)
[35456.186290] RIP: 0010:ext4_journal_check_start+0x4c/0x90
[35456.187540] Code: e1 02 75 24 f6 43 50 01 75 54 83 bb 20 03 00 00 04 74 17 48 8b 92 70 03 00 00 31 c0 48 85 d2 74 07 8b 02 83 e0 02 75 06 5b c3 <0f> 0b eb e5 44 8b 42 08 68 11 23 0f 82 48 89 df 45 31 c9 b9 01 00
[35456.190939] RSP: 0018:ffffc90004167850 EFLAGS: 00010246
[35456.191845] RAX: 00000000fffffffb RBX: ffff888009f76000 RCX: 0000000000000000
[35456.193668] RDX: ffff888009f73000 RSI: ffffffff8235dd60 RDI: ffff888009f76000
[35456.195461] RBP: 0000000000000009 R08: 0000000000000000 R09: ffffffff813c9fd8
[35456.197365] R10: 0000000000000000 R11: 000000000002f6b8 R12: 0000000000000000
[35456.199276] R13: 0000000000000008 R14: 0000000000000002 R15: 0000000000000771
[35456.201080] FS: 0000000000000000(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[35456.202513] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[35456.203521] CR2: 00007f30cf413000 CR3: 0000000001c04006 CR4: 0000000000370ee0
[35456.205554] Call Trace:
[35456.206275] __ext4_journal_start_sb+0x118/0x180
[35456.207568] ext4_writepage+0x368/0x740
[35456.209077] ? _raw_spin_unlock_irqrestore+0x30/0x40
[35456.210638] ? page_mkclean+0x6e/0xc0
[35456.211899] ? lock_is_held_type+0x99/0x100
[35456.213341] __writepage+0x19/0x70
[35456.214614] write_cache_pages+0x1a6/0x470
[35456.216026] ? __wb_calc_thresh+0xd0/0xd0
[35456.217617] generic_writepages+0x59/0x90
[35456.219119] ? lock_is_held_type+0x99/0x100
[35456.220673] ? do_writepages+0x41/0xd0
[35456.222038] ? do_writepages+0x41/0xd0
[35456.223341] ext4_writepages+0xc40/0x1080
[35456.224750] ? __lock_acquire+0x3d5/0x2380
[35456.226175] ? __lock_acquire+0x3d5/0x2380
[35456.227586] ? do_writepages+0x41/0xd0
[35456.228563] ? __ext4_mark_inode_dirty+0x280/0x280
[35456.230308] do_writepages+0x41/0xd0
[35456.231540] ? lock_is_held_type+0x99/0x100
[35456.233032] __writeback_single_inode+0x58/0x530
[35456.234542] writeback_sb_inodes+0x204/0x4d0
[35456.235995] wb_writeback+0x107/0x430
[35456.237489] wb_workfn+0xb5/0x5f0
[35456.238645] ? lock_acquire+0xb5/0x380
[35456.240026] ? process_one_work+0x223/0x5d0
[35456.241522] ? lock_is_held_type+0x99/0x100
[35456.243139] process_one_work+0x2a5/0x5d0
[35456.244615] ? process_one_work+0x5d0/0x5d0
[35456.246108] worker_thread+0x2d/0x3c0
[35456.247429] ? process_one_work+0x5d0/0x5d0
[35456.248926] kthread+0x127/0x140
[35456.250090] ? kthread_park+0x80/0x80
[35456.251379] ret_from_fork+0x22/0x30
[35456.252559] irq event stamp: 4858465
[35456.253785] hardirqs last enabled at (4858473): [<ffffffff8115ddee>] console_unlock+0x36e/0x550
[35456.256235] hardirqs last disabled at (4858482): [<ffffffff8115de77>] console_unlock+0x3f7/0x550
[35456.259117] softirqs last enabled at (4858496): [<ffffffff81c00303>] __do_softirq+0x303/0x41a
[35456.261485] softirqs last disabled at (4858491): [<ffffffff810ecade>] irq_exit_rcu+0xae/0xb0
[35456.263742] ---[ end trace ea82423dcfaf8730 ]---
> Changes since v1:
> * Rewritten the fix to avoid the need for separate transaction handle for
> orphan list update
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 194f5d00fa32..be1e80af61be 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -371,15 +371,27 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> int error, unsigned int flags)
> {
> - loff_t offset = iocb->ki_pos;
> + loff_t pos = iocb->ki_pos;
> struct inode *inode = file_inode(iocb->ki_filp);
>
> if (error)
> return error;
>
> - if (size && flags & IOMAP_DIO_UNWRITTEN)
> - return ext4_convert_unwritten_extents(NULL, inode,
> - offset, size);
> + if (size && flags & IOMAP_DIO_UNWRITTEN) {
> + error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
> + if (error < 0)
> + return error;
> + }
> + /*
> + * If we are extending the file, we have to update i_size here before
> + * page cache gets invalidated in iomap_dio_rw(). Otherwise racing
> + * buffered reads could zero out too much from page cache pages. Update
> + * of on-disk size will happen later in ext4_dio_write_iter() where
> + * we have enough information to also perform orphan list handling etc.
> + */
> + pos += size;
> + if (pos > i_size_read(inode))
> + i_size_write(inode, pos);
>
> return 0;
> }
> --
> 2.26.2
>
Powered by blists - more mailing lists