linux-kernel - Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPhsuW7KMLHHrcyZhKS_m_fwWSKM66VFXaLj9fmY+ab5Mu3pvA@mail.gmail.com>
Date: Tue, 23 Jan 2024 14:21:53 -0800
From: Song Liu <song@...nel.org>
To: Dan Moulding <dan@...m.net>
Cc: gregkh@...uxfoundation.org, junxiao.bi@...cle.com, 
	linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org, 
	regressions@...ts.linux.dev, stable@...r.kernel.org, yukuai1@...weicloud.com
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system;
 successfully bisected

Hi Dan,

On Tue, Jan 23, 2024 at 1:53 PM Dan Moulding <dan@...m.net> wrote:
>
> > I think we still want d6e035aad6c0 in 6.7.2. We may need to revert
> > 0de40f76d567 on top of that. Could you please test it out? (6.7.1 +
> > d6e035aad6c0 + revert 0de40f76d567.
>
> I was operating under the assumption that the two commits were
> intended to exist as a pair (the one reverts the old fix, because the
> next commit has what is supposed to be a better fix). But since the
> regression still exists, even with both patches applied, the old fix
> must be reapplied to resolve the current regression.
>
> But, as you've requested, I have tested 6.7.1 + d6e035aad6c0 + revert
> 0de40f76d567 and it seems fine. So I have no issue if you think it
> makes sense to accept d6e035aad6c0 on its own, even though it would
> break up the pair of commits.

Thanks for running the test!

>
> > OTOH, I am not able to reproduce the issue. Could you please help
> > get more information:
> >   cat /proc/mdstat
>
> Here is /proc/mdstat from one of the systems where I can reproduce it:
>
>     $ cat /proc/mdstat
>     Personalities : [raid6] [raid5] [raid4]
>     md0 : active raid5 dm-0[4](J) sdc[3] sda[0] sdb[1]
>           3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>
>     unused devices: <none>
>
> dm-0 is an LVM logical volume which is backed by an NVMe SSD. The
> others are run-of-the-mill SATA SSDs.
>
> >  profile (perf, etc.) of the md thread
>
> I might need a little more pointing in the direction of what exactly
> to look for and under what conditions (i.e. should I run perf while
> the thread is stuck in the 100% CPU loop? what kind of report should I
> ask perf for?). Also, are there any debug options I could enable in
> the kernel configuration that might help gather more information?
> Maybe something in debugfs? I currently get absolutely no warnings or
> errors in dmesg when the problem occurs.

This appears the md thread hit some infinite loop, so I would like to
know what it is doing. We can probably get the information with the
perf tool, something like:

perf record -a
perf report

Thanks,
Song