[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230614203740.GE51259@mit.edu>
Date: Wed, 14 Jun 2023 16:37:40 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: Zhihao Cheng <chengzhihao1@...wei.com>, linux-ext4@...r.kernel.org,
adilger.kernel@...ger.ca, jack@...e.cz, yi.zhang@...wei.com,
yukuai3@...wei.com
Subject: Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head
removing while doing checkpoint
On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote:
>
> Sorry about the regression, I found that this issue is not introduced
> by the first patch in this patch series ("jbd2: recheck chechpointing
> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
> initialization from journal_get_superblock()") [1] on your dev branch.
>
> The problem is the journal super block had been failed to write out
> due to IO fault, it's uptodate bit was cleared by
> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
> unfortunately, the read IO is also fail, so the error handling in
> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
> to above NULL pointer dereference issue.
Thanks for looking into this. What I believe you are saying is that
the root cause is that earlier patch, but it is still something about
the patch "jbd2: recheck chechpointing non-dirty buffer" which is
changing the timing enough that we're hitting this buffer (because
without the "recheck checkpointing" patch, I'm not seeing the NULL
pointer dereference.
As far as the e2fsck bug that was causing it to hang in the ext4/adv
test scenario, the patch was a simple one, and I have also checked in
a test case which was a reliable reproducer of the problem. (See
attached for the patches and more detail.)
It is really interesting that "recheck checkpointing" patch is making
enough of a difference that it is unmasking these bugs. If you could
take a look at these changes and perhaps think about how this patch
series could be changing the nature of the corruption (specifically,
how symlink inodes referenced from inline directories could be
corupted with "rechecking checkpointing", thus unmasking the
e2fsprogs, I'd really appreciate it.
Thanks,
- Ted
View attachment "0001-e2fsck-fix-handling-of-a-invalid-symlink-in-an-inlin.patch" of type "text/x-diff" (1609 bytes)
View attachment "0002-tests-add-test-for-handling-an-invalid-symlink-in-an.patch" of type "text/x-diff" (3924 bytes)
Powered by blists - more mailing lists