linux-kernel - Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240302165538.30761-1-dan@danm.net>
Date: Sat,  2 Mar 2024 09:55:38 -0700
From: Dan Moulding <dan@...m.net>
To: junxiao.bi@...cle.com
Cc: dan@...m.net,
	gregkh@...uxfoundation.org,
	linux-kernel@...r.kernel.org,
	linux-raid@...r.kernel.org,
	regressions@...ts.linux.dev,
	song@...nel.org,
	stable@...r.kernel.org,
	logang@...tatee.com
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> I have not root cause this yet, but would like share some findings from 
> the vmcore Dan shared. From what i can see, this doesn't look like a md 
> issue, but something wrong with block layer or below.

Below is one other thing I found that might be of interest. This is
from the original email thread [1] that was linked to in the original
issue from 2022, which the change in question reverts:

On 2022-09-02 17:46, Logan Gunthorpe wrote:
> I've made some progress on this nasty bug. I've got far enough to know it's not
> related to the blk-wbt or the block layer.
> 
> Turns out a bunch of bios are stuck queued in a blk_plug in the md_raid5 
> thread while that thread appears to be stuck in an infinite loop (so it never
> schedules or does anything to flush the plug). 
> 
> I'm still debugging to try and find out the root cause of that infinite loop, 
> but I just wanted to send an update that the previous place I was stuck at
> was not correct.
> 
> Logan

This certainly sounds like it has some similarities to what we are
seeing when that change is reverted. The md0_raid5 thread appears to be
in an infinite loop, consuming 100% CPU, but not actually doing any
work.

-- Dan

[1] https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com