[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240123215307.8083-1-dan@danm.net>
Date: Tue, 23 Jan 2024 14:53:07 -0700
From: Dan Moulding <dan@...m.net>
To: song@...nel.org
Cc: dan@...m.net,
gregkh@...uxfoundation.org,
junxiao.bi@...cle.com,
linux-kernel@...r.kernel.org,
linux-raid@...r.kernel.org,
regressions@...ts.linux.dev,
stable@...r.kernel.org,
yukuai1@...weicloud.com
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
> I think we still want d6e035aad6c0 in 6.7.2. We may need to revert
> 0de40f76d567 on top of that. Could you please test it out? (6.7.1 +
> d6e035aad6c0 + revert 0de40f76d567.
I was operating under the assumption that the two commits were
intended to exist as a pair (the one reverts the old fix, because the
next commit has what is supposed to be a better fix). But since the
regression still exists, even with both patches applied, the old fix
must be reapplied to resolve the current regression.
But, as you've requested, I have tested 6.7.1 + d6e035aad6c0 + revert
0de40f76d567 and it seems fine. So I have no issue if you think it
makes sense to accept d6e035aad6c0 on its own, even though it would
break up the pair of commits.
> OTOH, I am not able to reproduce the issue. Could you please help
> get more information:
> cat /proc/mdstat
Here is /proc/mdstat from one of the systems where I can reproduce it:
$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 dm-0[4](J) sdc[3] sda[0] sdb[1]
3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
unused devices: <none>
dm-0 is an LVM logical volume which is backed by an NVMe SSD. The
others are run-of-the-mill SATA SSDs.
> profile (perf, etc.) of the md thread
I might need a little more pointing in the direction of what exactly
to look for and under what conditions (i.e. should I run perf while
the thread is stuck in the 100% CPU loop? what kind of report should I
ask perf for?). Also, are there any debug options I could enable in
the kernel configuration that might help gather more information?
Maybe something in debugfs? I currently get absolutely no warnings or
errors in dmesg when the problem occurs.
Cheers,
-- Dan
Powered by blists - more mailing lists