[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8fe067d0-6d57-9dd7-2c10-5a2c34037ee1@colorfullife.com>
Date: Thu, 30 Dec 2021 09:16:13 +0100
From: Manfred Spraul <manfred@...orfullife.com>
To: Theodore Ts'o <tytso@....edu>
Cc: adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org, 1vier1@....de
Subject: Re: JBD2: journal transaction 6943 on loop0-8 is corrupt.
Hi Ted,
On 12/30/21 02:37, Theodore Ts'o wrote:
> On Tue, Dec 28, 2021 at 09:36:22PM +0100, Manfred Spraul wrote:
>> Hi,
>>
>> with simulated power failures, I see a corrupted journal
>>
>> [39056.200845] JBD2: journal transaction 6943 on loop0-8 is corrupt.
>> [39056.200851] EXT4-fs (loop0): error loading journal
> This means that the journal replay found a commit which was *not* the
> last commit, and which contained a CRC error. If it's the last commit
> (e.g., there is no valid subsequent commit block), then it's possible
> that the journal commit was never completed before the system crashed
> --- e.g., it was an interrupted commit.
It is the last commit, there are no valid subsequent commit blocks.
The current failure model is simple: all blocks up to block <n> are
written, the blocks starting from <n+1> are discarded.
What I can't rule out, but I think this is not what I see:
The image is small (512 MB) and everything is in memory. Thus I would
not rule out that the whole journal is filled within less than one second.
The commit header contains h_commit_sec and h_commit_nsec, but from what
I see, do_one_pass() evaluates only h_commit_sec.
> Your test is aborting the commit at various points in the write I/O
> stream, so it should be simulating an interrupted commit (assuming
> that it's not corrupting any I/O. So the jbd2 layer should have
> understood it was the last commit in the journal, and been OK with the
> checksum failure.
I think the jbd2 layer understood that it was the last commit - but it
nevertheless failed the recovery.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2/recovery.c?h=v5.16-rc7#n809
> |
> if (pass == PASS_SCAN &&
> !jbd2_commit_block_csum_verify(journal,
> bh->b_data)) {
> chksum_error:
> if (commit_time < last_trans_commit_time)
> goto ignore_crc_mismatch;
> info->end_transaction = next_commit_ID;
>
> if (!jbd2_has_feature_async_commit(journal)) {
> journal->j_failed_commit =
> next_commit_ID;
> brelse(bh);
> break;
> }
> }
> |
async_commit() is false.
journal->j_failed_commit is set, and thus after journal playback, a
failure is reported :-(
What I have done:
- I have removed the line journal->j_failed_commit = next_commit_ID,
then the image is mounted.
- I had added pr_info() lines, and this confirms that it starts with a
failure of jbd2_block_csum_verify().
dmesg with JBD2 debug output and additional pr_info lines:
>>>
[ 748.591348] next_commit_id increased: 6943.
[ 748.593397] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6943 at 4544/8192
[ 748.594322] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 4544
[ 748.595879] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 5,
sequence 6943
[ 748.596800] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6943 at 4545/8192
[ 748.600073] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 4545
[ 748.601398] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 1,
sequence 6943
[ 748.603481] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6943 at 4574/8192
[ 748.606238] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 4574
[ 748.607241] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 2,
sequence 6943
[ 748.608248] commit_csum_verify error: provided 81be34bd.
[ 748.610199] direct block verify error.
[ 748.612327] chksum_error.
[ 748.617616] error line 2.
[ 748.622085] error line 3.
[ 748.623291] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6943 at 4575/8192
[ 748.624133] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 4575
[ 748.625134] done, info->end_transaction is 6943.
[ 748.626966] done with update, info->end_transaction is 6943.
[ 748.629521] next_commit_id: initial value 6799.
[ 748.630836] fs/jbd2/recovery.c: (do_one_pass, 491): Starting recovery
pass 1
[ 748.631529] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for
sequence ID 6799 at 6734/8192
[ 748.633346] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking
block 6734
<<<
What do you think?
Is JBD2 too aggressive in declaring something a journal corruption?
What is definitively correct is that the failure is minor. fsck -f was
able to correct it.
Thus: What is your understanding:
If a mount command fails due to a journal corruption, should user space
first try a fsck -f before giving up?
--
Manfred
View attachment "0001-jbd2-recovery.c-Continue-on-csum-failures-for-commit.patch" of type "text/x-patch" (2744 bytes)
View attachment "0003-DEBUG-patch-add-printk-to-fs-jbd2-recovery.c.patch" of type "text/x-patch" (3320 bytes)
Powered by blists - more mailing lists