lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 30 Dec 2021 09:16:13 +0100 From: Manfred Spraul <manfred@...orfullife.com> To: Theodore Ts'o <tytso@....edu> Cc: adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org, 1vier1@....de Subject: Re: JBD2: journal transaction 6943 on loop0-8 is corrupt. Hi Ted, On 12/30/21 02:37, Theodore Ts'o wrote: > On Tue, Dec 28, 2021 at 09:36:22PM +0100, Manfred Spraul wrote: >> Hi, >> >> with simulated power failures, I see a corrupted journal >> >> [39056.200845] JBD2: journal transaction 6943 on loop0-8 is corrupt. >> [39056.200851] EXT4-fs (loop0): error loading journal > This means that the journal replay found a commit which was *not* the > last commit, and which contained a CRC error. If it's the last commit > (e.g., there is no valid subsequent commit block), then it's possible > that the journal commit was never completed before the system crashed > --- e.g., it was an interrupted commit. It is the last commit, there are no valid subsequent commit blocks. The current failure model is simple: all blocks up to block <n> are written, the blocks starting from <n+1> are discarded. What I can't rule out, but I think this is not what I see: The image is small (512 MB) and everything is in memory. Thus I would not rule out that the whole journal is filled within less than one second. The commit header contains h_commit_sec and h_commit_nsec, but from what I see, do_one_pass() evaluates only h_commit_sec. > Your test is aborting the commit at various points in the write I/O > stream, so it should be simulating an interrupted commit (assuming > that it's not corrupting any I/O. So the jbd2 layer should have > understood it was the last commit in the journal, and been OK with the > checksum failure. I think the jbd2 layer understood that it was the last commit - but it nevertheless failed the recovery. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2/recovery.c?h=v5.16-rc7#n809 > | > if (pass == PASS_SCAN && > !jbd2_commit_block_csum_verify(journal, > bh->b_data)) { > chksum_error: > if (commit_time < last_trans_commit_time) > goto ignore_crc_mismatch; > info->end_transaction = next_commit_ID; > > if (!jbd2_has_feature_async_commit(journal)) { > journal->j_failed_commit = > next_commit_ID; > brelse(bh); > break; > } > } > | async_commit() is false. journal->j_failed_commit is set, and thus after journal playback, a failure is reported :-( What I have done: - I have removed the line journal->j_failed_commit = next_commit_ID, then the image is mounted. - I had added pr_info() lines, and this confirms that it starts with a failure of jbd2_block_csum_verify(). dmesg with JBD2 debug output and additional pr_info lines: >>> [ 748.591348] next_commit_id increased: 6943. [ 748.593397] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6943 at 4544/8192 [ 748.594322] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 4544 [ 748.595879] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 5, sequence 6943 [ 748.596800] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6943 at 4545/8192 [ 748.600073] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 4545 [ 748.601398] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 1, sequence 6943 [ 748.603481] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6943 at 4574/8192 [ 748.606238] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 4574 [ 748.607241] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 2, sequence 6943 [ 748.608248] commit_csum_verify error: provided 81be34bd. [ 748.610199] direct block verify error. [ 748.612327] chksum_error. [ 748.617616] error line 2. [ 748.622085] error line 3. [ 748.623291] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6943 at 4575/8192 [ 748.624133] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 4575 [ 748.625134] done, info->end_transaction is 6943. [ 748.626966] done with update, info->end_transaction is 6943. [ 748.629521] next_commit_id: initial value 6799. [ 748.630836] fs/jbd2/recovery.c: (do_one_pass, 491): Starting recovery pass 1 [ 748.631529] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6799 at 6734/8192 [ 748.633346] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 6734 <<< What do you think? Is JBD2 too aggressive in declaring something a journal corruption? What is definitively correct is that the failure is minor. fsck -f was able to correct it. Thus: What is your understanding: If a mount command fails due to a journal corruption, should user space first try a fsck -f before giving up? -- Manfred View attachment "0001-jbd2-recovery.c-Continue-on-csum-failures-for-commit.patch" of type "text/x-patch" (2744 bytes) View attachment "0003-DEBUG-patch-add-printk-to-fs-jbd2-recovery.c.patch" of type "text/x-patch" (3320 bytes)
Powered by blists - more mailing lists