[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <388f4a26-c689-3c0b-e1b4-45e68c245c6d@huaweicloud.com>
Date: Thu, 15 Jun 2023 16:22:50 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Theodore Ts'o <tytso@....edu>
Cc: linux-ext4@...r.kernel.org, adilger.kernel@...ger.ca, jack@...e.cz,
yi.zhang@...wei.com, chengzhihao1@...wei.com, yukuai3@...wei.com
Subject: Re: [PATCH] jbd2: skip reading super block if it has been verified
On 2023/6/15 13:26, Theodore Ts'o wrote:
> On Thu, Jun 15, 2023 at 11:49:41AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@...wei.com>
>>
>> We got a NULL pointer dereference issue below while running generic/475
>> I/O failure pressure test.
>
> Have you been able to reproduce this failure without the "recheck
> checkpoint" series applied? I have not, so like with the e2fsck bug
> fix, I can understand how the bug fix worked, but I still don't
> understand why I wasn't seeing until I tried to apply the "recheck
> chekcpoint" and the following patches in that patch series.
Yes, I can reproduce this failure without the "recheck
checkpoint" series applied, I reproduced it in ranges from about 5
minutes to 1 hour on your dev branch(just reset to the parent commit
5404e4738054 "ext4: refactoring to use the unified helper
ext4_quotas_off()") with below fstests config.
# ext4 regression fstests config
[ext4]
export FSTYP=ext4
export TEST_DEV=/dev/pmem0p1
export TEST_DIR=/mnt/test
export SCRATCH_DEV=/dev/pmem0p2
export SCRATCH_MNT=/mnt/scratch
export LOGWRITES_DEV=/dev/vdc1
export SCRATCH_LOGDEV=/dev/vdc2
export MKFS_OPTIONS="-O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_file,^dir_nlink,^extra_isize"
[ 315.435845] EXT4-fs (dm-0): previous I/O error to superblock detected
[ 315.435877] EXT4-fs (dm-0): I/O error while writing superblock
[ 315.435885] EXT4-fs (dm-0): Remounting filesystem read-only
[ 315.438261] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 315.453689] #PF: supervisor write access in kernel mode
[ 315.454884] #PF: error_code(0x0002) - not-present page
[ 315.456048] PGD 139b3b067 P4D 139b3b067 PUD 1538ea067 PMD 0
[ 315.456201] EXT4-fs error (device dm-0): __ext4_find_entry:1678: inode #131073: comm fsstress: reading directory lblock 0
[ 315.457403] Oops: 0002 [#1] PREEMPT SMP
[ 315.457411] CPU: 14 PID: 10107 Comm: fsstress Not tainted 6.4.0-rc5-00054-g5404e4738054 #214
[ 315.457416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
[ 315.457418] RIP: 0010:jbd2_journal_set_features+0xf4/0x500
[ 315.461326] EXT4-fs (dm-0): I/O error while writing superblock
[ 315.462073] Code: 48 83 05 5e 32 90 0c 01 48 83 05 f6 05 90 0c 01 4d 8b 74 24 38 e8 dc 6c bc 00 48 83 05 ec 05 90 0c 01 48 83 05 bc 05 90 0c 01 <f0> 49 0f ba 2e 02 0f 92 c0 48 83 05 b3 05 90 0c 01 48 83 05 d5
[ 315.462086] RSP: 0018:ffffc900116cbad8 EFLAGS: 00010212
[ 315.462103] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
[ 315.462107] RDX: 0000000080000000 RSI: ffffffffafd25d54 RDI: 0000000000000001
[ 315.462115] RBP: 0000000000000000 R08: ffffffffafd256f0 R09: 0000000000000000
[ 315.468526] R10: 642820726f727265 R11: 2073662d34545845 R12: ffff88817e85e800
[ 315.468535] R13: 0000000000000000 R14: 0000000000000000 R15: ffff888126d93000
[ 315.468548] FS: 00007fda46982b80(0000) GS:ffff888237980000(0000) knlGS:0000000000000000
[ 315.468560] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 315.468568] CR2: 0000000000000000 CR3: 00000001398d0000 CR4: 00000000000006e0
[ 315.487792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 315.487798] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 315.491048] Call Trace:
[ 315.491560] <TASK>
[ 315.494065] ? show_regs+0x84/0x90
[ 315.494089] ? __die_body+0x22/0x90
[ 315.494104] ? __die+0x35/0x50
[ 315.494121] ? page_fault_oops+0x1d3/0x5e0
[ 315.515211] ? search_bpf_extables+0x85/0xc0
[ 315.523168] ? jbd2_journal_set_features+0xf4/0x500
[ 315.524214] ? search_exception_tables+0x7c/0x90
[ 315.525211] ? kernelmode_fixup_or_oops+0x140/0x1a0
[ 315.526370] ? __bad_area_nosemaphore+0x208/0x350
[ 315.527475] ? mt_find+0x2ab/0x3c0
[ 315.528718] ? __bad_area+0x88/0xc0
[ 315.529936] ? bad_area+0x1a/0x30
[ 315.530696] ? do_user_addr_fault+0xa6d/0xd00
[ 315.531550] ? exc_page_fault+0xe7/0x3b0
[ 315.532339] ? asm_exc_page_fault+0x22/0x30
[ 315.533153] ? jbd2_journal_set_features+0xf4/0x500
[ 315.533922] ? jbd2_journal_set_features+0xe4/0x500
[ 315.534636] jbd2_journal_revoke+0x43/0x330
[ 315.535272] __ext4_forget+0x112/0x2c0
[ 315.535804] ? __find_get_block+0x155/0x5a0
[ 315.536443] ext4_free_blocks+0xbd2/0xf20
[ 315.537058] ? ext4_free_data+0x140/0x210
[ 315.538420] ? ext4_free_branches+0x2d4/0x3a0
[ 315.540534] ext4_free_branches+0x1c9/0x3a0
[ 315.542064] ext4_ind_truncate+0x361/0x3f0
[ 315.543304] ? ext4_discard_preallocations+0x3c1/0x740
[ 315.546111] ext4_truncate+0x4a0/0x710
[ 315.547623] ext4_file_write_iter+0xb8d/0xe90
[ 315.548940] vfs_write+0x20e/0x590
[ 315.549986] ksys_write+0x77/0x160
[ 315.552027] __x64_sys_write+0x1d/0x30
[ 315.553492] do_syscall_64+0x68/0xf0
[ 315.554711] entry_SYSCALL_64_after_hwframe+0x63/0xcd
I also try to accelerate reproduce in about 2 mins through add
delay in jbd2_write_superblock() either applied the "recheck
chekcpoint" patch series or not.
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index b5e57735ab3f..90d78fe0fb33 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1623,6 +1623,7 @@ static int journal_reset(journal_t *journal)
* This function expects that the caller will have locked the journal
* buffer head, and will return with it unlocked
*/
+#include <linux/delay.h>
static int jbd2_write_superblock(journal_t *journal, blk_opf_t write_flags)
{
struct buffer_head *bh = journal->j_sb_buffer;
@@ -1659,6 +1660,7 @@ static int jbd2_write_superblock(journal_t *journal, blk_opf_t write_flags)
bh->b_end_io = end_buffer_write_sync;
submit_bh(REQ_OP_WRITE | write_flags, bh);
wait_on_buffer(bh);
+ msleep(10);
if (buffer_write_io_error(bh)) {
clear_buffer_write_io_error(bh);
set_buffer_uptodate(bh);
>
>> If the journal super block had been read and verified, there is no need
>> to call bh_read() read it again even if it has been failed to written
>> out. So the fix could be simply move buffer_verified(bh) in front of
>> bh_read().
>>
>> Fixes: d9eafe0afafa ("jbd2: factor out journal initialization from journal_get_superblock()")
>
> That works, but it's worth noting that commit d9eafe0afafa caused the
> failure by removing the check on j_journal_version to determine
> whether the superblock was read or not. If the journal superblock had
> been previously read, j_journal_version would be either 1 or 2. If it
> had been zero, then superblock was not read. So from commit
> d9eafe0afafa:
>
> /* Load journal superblock if it is not loaded yet. */
> - if (journal->j_format_version == 0 &&
> - journal_get_superblock(journal) != 0)
> + if (journal_get_superblock(journal))
> return 0;
> if (!jbd2_format_support_feature(journal))
> return 0;
>
>
> The comment "Load journal superblock if it is not loaded yet." should
> be removed, since it no longer makes sense once the
> "journal->j_format_version == 0" check was removed.
Yes.
>
> I'll also note that a problem with d9eafe0afafa is that by removing
> the j_format_version check, every time we add a revoke header, and we
> call jbd2_journal_set_features(), this was causing an unconditional
> read of the journal superblock and that unnecessary I/O could slow
> down certain workloads.
>
Yes, fortunately it is innocuous in general because the journal super
block buffer is always in memory and uptodate, therefore bh_read() does
not submit I/O. It's only affects the fault case about the window in
jbd2_write_superblock() which the journal super block has been failed
to write out and has not been restore to uptodate yet.
Thanks,
Yi.
Powered by blists - more mailing lists