linux-ext4 - Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20002902-39c5-914b-75b0-5a21b5cee25c@huawei.com>
Date:   Tue, 13 Jun 2023 16:13:06 +0800
From:   Zhihao Cheng <chengzhihao1@...wei.com>
To:     Theodore Ts'o <tytso@....edu>, Zhang Yi <yi.zhang@...weicloud.com>
CC:     <linux-ext4@...r.kernel.org>, <adilger.kernel@...ger.ca>,
        <jack@...e.cz>, <yi.zhang@...wei.com>, <yukuai3@...wei.com>
Subject: Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head
 removing while doing checkpoint

在 2023/6/13 12:31, Theodore Ts'o 写道:
> There is something about this patch which is causing test runs to hang
> when running "gce-xfstests -c ext4/adv -C 10 generic/475" at least
> 60-70% of the time.
> 
> When I took a closer look, the problem seems to be e2fsck is hanging
> after a SEGV when running e2fsck -nf on the block device.  This then
> causes the check script to hang, until the test appliance's safety
> timer triggers and forces a shutdown of the test VM and aborts the
> test run.
> 
> The cause of the hang is clearly an e2fsprogs bug --- no matter how
> corrupted the file system is, e2fsck should never crash or hang.  So
> something is clearly going wrong with e2fsck:
> 
>      ...
>      Symlink /p1/dc/d14/dee/l154 (inode #2898) is invalid.
>      Clear? no
> 
>      Entry 'l154' in /p1/dc/d14/dee (2753) has an incorrect filetype (was 7, should be 0).
>      Fix? no
> 
>      corrupted size vs. prev_size
>      Signal (6) SIGABRT si_code=SI_TKILL
> 
>      (Note: "corrutped size vs prev_size" is issued by glibc when
>      malloc's internal data structures have been corrupted.  So
>      there is definitely something going very wrong with e2fsck.)
>      
> That being said, if I run the same test on the parent commit (patch
> 3/6, jbd2: remove journal_clean_one_cp_list()), e2fsck does *not* hang
> or crash, and the regression tests complete.  So this patch is
> changing the behavior of the kernel in terms of the file system that
> is left behind after a large number of injected I/O errors.
> 
> My plan therefore is to drop patches 4/6 through 6/6 of this patch
> series.  This will allow at least the "long standing metadata
> corruption issue that happens from to time" to be addressed, and it
> will give us time study what's going on here in more detail.  I've
> captured the compressed file system image which is causing e2fsck
> (version 1.47.0) to corrupt malloc's data structure, and I'll try see
> what using Address Sanitizer or valgrind show about what's going on.
> 

Hi Ted, I tried to run './check generic/475' many rounds(1.47.0, 
5-Feb-2023), and I cannot reproduce the problem with this patch. Could 
you send me a compressed image which can trigger the problem with 'fsck 
-fn'?

I agree to make clear the problem first before applying this patch.

> Looking at the patch, it looks pretty innocuous, and I don't
> understand how this could be making a significant enough difference
> that it's causing e2fsck, which had previously been working fine, to
> now start tossing its cookies.  If you could double check the patch
> and see you see anything that I might have missed in my code review,
> I'd really appreciate it.
> 
> Thanks,
> 
> 					- Ted
> 
> .
>