linux-kernel - Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1d3ed2b9-a44e-45f9-a523-d219c141ea5a@leemhuis.info>
Date: Wed, 6 Mar 2024 09:38:17 +0100
From: "Linux regression tracking (Thorsten Leemhuis)"
 <regressions@...mhuis.info>
To: Song Liu <song@...nel.org>, Dan Moulding <dan@...m.net>
Cc: junxiao.bi@...cle.com, gregkh@...uxfoundation.org,
 linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
 regressions@...ts.linux.dev, stable@...r.kernel.org
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system;
 successfully bisected

On 02.03.24 01:05, Song Liu wrote:
> On Fri, Mar 1, 2024 at 3:12 PM Dan Moulding <dan@...m.net> wrote:
>>
>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
>>> some issue which leading to the io request from md layer stayed in a
>>> partial complete statue. I can't see how this can be related with the
>>> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
>>> raid5d"")
>>
>> There is no question that the above mentioned commit makes this
>> problem appear. While it may be that ultimately the root cause lies
>> outside the md/raid5 code (I'm not able to make such an assessment), I
>> can tell you that change is what turned it into a runtime
>> regression. Prior to that change, I cannot reproduce the problem. One
>> of my RAID-5 arrays has been running on every kernel version since
>> 4.8, without issue. Then kernel 6.7.1 the problem appeared within
>> hours of running the new code and affected not just one but two
>> different machines with RAID-5 arrays. With that change reverted, the
>> problem is not reproducible. Then when I recently upgraded to 6.8-rc5
>> I immediately hit the problem again (because it hadn't been reverted
>> in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
>> machines without issue after reverting that commit on top of it.
> [...]
> I also tried again to reproduce the issue, but haven't got luck. While
> I will continue try to repro the issue, I will also send the revert to 6.8
> kernel.

Is that revert on the way meanwhile? I'm asking because Linus might
release 6.8 on Sunday.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.