[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20002902-39c5-914b-75b0-5a21b5cee25c@huawei.com>
Date: Tue, 13 Jun 2023 16:13:06 +0800
From: Zhihao Cheng <chengzhihao1@...wei.com>
To: Theodore Ts'o <tytso@....edu>, Zhang Yi <yi.zhang@...weicloud.com>
CC: <linux-ext4@...r.kernel.org>, <adilger.kernel@...ger.ca>,
<jack@...e.cz>, <yi.zhang@...wei.com>, <yukuai3@...wei.com>
Subject: Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head
removing while doing checkpoint
在 2023/6/13 12:31, Theodore Ts'o 写道:
> There is something about this patch which is causing test runs to hang
> when running "gce-xfstests -c ext4/adv -C 10 generic/475" at least
> 60-70% of the time.
>
> When I took a closer look, the problem seems to be e2fsck is hanging
> after a SEGV when running e2fsck -nf on the block device. This then
> causes the check script to hang, until the test appliance's safety
> timer triggers and forces a shutdown of the test VM and aborts the
> test run.
>
> The cause of the hang is clearly an e2fsprogs bug --- no matter how
> corrupted the file system is, e2fsck should never crash or hang. So
> something is clearly going wrong with e2fsck:
>
> ...
> Symlink /p1/dc/d14/dee/l154 (inode #2898) is invalid.
> Clear? no
>
> Entry 'l154' in /p1/dc/d14/dee (2753) has an incorrect filetype (was 7, should be 0).
> Fix? no
>
> corrupted size vs. prev_size
> Signal (6) SIGABRT si_code=SI_TKILL
>
> (Note: "corrutped size vs prev_size" is issued by glibc when
> malloc's internal data structures have been corrupted. So
> there is definitely something going very wrong with e2fsck.)
>
> That being said, if I run the same test on the parent commit (patch
> 3/6, jbd2: remove journal_clean_one_cp_list()), e2fsck does *not* hang
> or crash, and the regression tests complete. So this patch is
> changing the behavior of the kernel in terms of the file system that
> is left behind after a large number of injected I/O errors.
>
> My plan therefore is to drop patches 4/6 through 6/6 of this patch
> series. This will allow at least the "long standing metadata
> corruption issue that happens from to time" to be addressed, and it
> will give us time study what's going on here in more detail. I've
> captured the compressed file system image which is causing e2fsck
> (version 1.47.0) to corrupt malloc's data structure, and I'll try see
> what using Address Sanitizer or valgrind show about what's going on.
>
Hi Ted, I tried to run './check generic/475' many rounds(1.47.0,
5-Feb-2023), and I cannot reproduce the problem with this patch. Could
you send me a compressed image which can trigger the problem with 'fsck
-fn'?
I agree to make clear the problem first before applying this patch.
> Looking at the patch, it looks pretty innocuous, and I don't
> understand how this could be making a significant enough difference
> that it's causing e2fsck, which had previously been working fine, to
> now start tossing its cookies. If you could double check the patch
> and see you see anything that I might have missed in my code review,
> I'd really appreciate it.
>
> Thanks,
>
> - Ted
>
> .
>
Powered by blists - more mailing lists