linux-ext4 - Re: JBD2: journal transaction 6943 on loop0-8 is corrupt.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8fe067d0-6d57-9dd7-2c10-5a2c34037ee1@colorfullife.com>
Date:   Thu, 30 Dec 2021 09:16:13 +0100
From:   Manfred Spraul <manfred@...orfullife.com>
To:     Theodore Ts'o <tytso@....edu>
Cc:     adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org, 1vier1@....de
Subject: Re: JBD2: journal transaction 6943 on loop0-8 is corrupt.

Hi Ted,

On 12/30/21 02:37, Theodore Ts'o wrote:
> On Tue, Dec 28, 2021 at 09:36:22PM +0100, Manfred Spraul wrote:
>> Hi,
>>
>> with simulated power failures, I see a corrupted journal
>>
>> [39056.200845] JBD2: journal transaction 6943 on loop0-8 is corrupt.
>> [39056.200851] EXT4-fs (loop0): error loading journal
> This means that the journal replay found a commit which was *not* the
> last commit, and which contained a CRC error.  If it's the last commit
> (e.g., there is no valid subsequent commit block), then it's possible
> that the journal commit was never completed before the system crashed
> --- e.g., it was an interrupted commit.

It is the last commit, there are no valid subsequent commit blocks.

The current failure model is simple: all blocks up to block <n> are 
written, the blocks starting from <n+1> are discarded.

What I can't rule out, but I think this is not what I see:
The image is small (512 MB) and everything is in memory. Thus I would 
not rule out that the whole journal is filled within less than one second.
The commit header contains h_commit_sec and h_commit_nsec, but from what 
I see, do_one_pass() evaluates only h_commit_sec.


> Your test is aborting the commit at various points in the write I/O
> stream, so it should be simulating an interrupted commit (assuming
> that it's not corrupting any I/O.  So the jbd2 layer should have
> understood it was the last commit in the journal, and been OK with the
> checksum failure.

I think the jbd2 layer understood that it was the last commit - but it 
nevertheless failed the recovery.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2/recovery.c?h=v5.16-rc7#n809

> |
> 		if  (pass  ==  PASS_SCAN  &&
> 			!jbd2_commit_block_csum_verify(journal,
> 							bh->b_data))  {
> 			chksum_error:
> 				if  (commit_time  <  last_trans_commit_time)
> 					goto  ignore_crc_mismatch;
> 				info->end_transaction  =  next_commit_ID;
>
> 				if  (!jbd2_has_feature_async_commit(journal))  {
> 					journal->j_failed_commit  =
> 						next_commit_ID;
> 					brelse(bh);
> 					break;
> 				}
> 			}
> |

async_commit() is false.

journal->j_failed_commit is set, and thus after journal playback, a 
failure is reported :-(


What I have done:

- I have removed the line journal->j_failed_commit = next_commit_ID, 
then the image is mounted.

- I had added pr_info() lines, and this confirms that it starts with a 
failure of jbd2_block_csum_verify().

dmesg with JBD2 debug output and additional pr_info lines:

 >>>

[  748.591348] next_commit_id increased: 6943.
[  748.593397] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6943 at 4544/8192
[  748.594322] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 4544
[  748.595879] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 5, 
sequence 6943
[  748.596800] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6943 at 4545/8192
[  748.600073] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 4545
[  748.601398] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 1, 
sequence 6943
[  748.603481] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6943 at 4574/8192
[  748.606238] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 4574
[  748.607241] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 2, 
sequence 6943
[  748.608248] commit_csum_verify error: provided 81be34bd.
[  748.610199] direct block verify error.
[  748.612327] chksum_error.
[  748.617616] error line 2.
[  748.622085] error line 3.
[  748.623291] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6943 at 4575/8192
[  748.624133] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 4575
[  748.625134] done, info->end_transaction is 6943.
[  748.626966] done with update, info->end_transaction is 6943.
[  748.629521] next_commit_id: initial value 6799.
[  748.630836] fs/jbd2/recovery.c: (do_one_pass, 491): Starting recovery 
pass 1
[  748.631529] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6799 at 6734/8192
[  748.633346] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 6734

<<<

What do you think?
Is JBD2 too aggressive in declaring something a journal corruption?

What is definitively correct is that the failure is minor. fsck -f was 
able to correct it.
Thus: What is your understanding:
If a mount command fails due to a journal corruption, should user space 
first try a fsck -f before giving up?


--

     Manfred

View attachment "0001-jbd2-recovery.c-Continue-on-csum-failures-for-commit.patch" of type "text/x-patch" (2744 bytes)

View attachment "0003-DEBUG-patch-add-printk-to-fs-jbd2-recovery.c.patch" of type "text/x-patch" (3320 bytes)