[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4e82b97c-137f-52d0-da11-555184fe01f5@huaweicloud.com>
Date: Thu, 15 Jun 2023 11:56:13 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Theodore Ts'o <tytso@....edu>
Cc: Zhihao Cheng <chengzhihao1@...wei.com>, linux-ext4@...r.kernel.org,
adilger.kernel@...ger.ca, jack@...e.cz, yi.zhang@...wei.com,
yukuai3@...wei.com
Subject: Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head
removing while doing checkpoint
On 2023/6/15 4:37, Theodore Ts'o wrote:
> On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote:
>>
>> Sorry about the regression, I found that this issue is not introduced
>> by the first patch in this patch series ("jbd2: recheck chechpointing
>> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
>> initialization from journal_get_superblock()") [1] on your dev branch.
>>
>> The problem is the journal super block had been failed to write out
>> due to IO fault, it's uptodate bit was cleared by
>> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
>> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
>> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
>> unfortunately, the read IO is also fail, so the error handling in
>> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
>> to above NULL pointer dereference issue.
>
> Thanks for looking into this. What I believe you are saying is that
> the root cause is that earlier patch, but it is still something about
> the patch "jbd2: recheck chechpointing non-dirty buffer" which is
> changing the timing enough that we're hitting this buffer (because
> without the "recheck checkpointing" patch, I'm not seeing the NULL
> pointer dereference.
I have send out a separate patch names "jbd2: skip reading super block
if it has been verified" to fix above NULL pointer dereference issue, I
have been runing ext3 generic/475 about 12hours and have not reproduced
the problem again (I will also do more tests later). Please take a look
at it.
>
> As far as the e2fsck bug that was causing it to hang in the ext4/adv
> test scenario, the patch was a simple one, and I have also checked in
> a test case which was a reliable reproducer of the problem. (See
> attached for the patches and more detail.)
>
> It is really interesting that "recheck checkpointing" patch is making
> enough of a difference that it is unmasking these bugs. If you could
> take a look at these changes and perhaps think about how this patch
> series could be changing the nature of the corruption (specifically,
> how symlink inodes referenced from inline directories could be
> corupted with "rechecking checkpointing", thus unmasking the
> e2fsprogs, I'd really appreciate it.
>
Sure, we will take a look at it for details.
Thanks,
Yi.
Powered by blists - more mailing lists