lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240301231222.20120-1-dan@danm.net>
Date: Fri,  1 Mar 2024 16:12:22 -0700
From: Dan Moulding <dan@...m.net>
To: junxiao.bi@...cle.com
Cc: dan@...m.net,
	gregkh@...uxfoundation.org,
	linux-kernel@...r.kernel.org,
	linux-raid@...r.kernel.org,
	regressions@...ts.linux.dev,
	song@...nel.org,
	stable@...r.kernel.org
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> some issue which leading to the io request from md layer stayed in a
> partial complete statue. I can't see how this can be related with the
> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> raid5d"")

There is no question that the above mentioned commit makes this
problem appear. While it may be that ultimately the root cause lies
outside the md/raid5 code (I'm not able to make such an assessment), I
can tell you that change is what turned it into a runtime
regression. Prior to that change, I cannot reproduce the problem. One
of my RAID-5 arrays has been running on every kernel version since
4.8, without issue. Then kernel 6.7.1 the problem appeared within
hours of running the new code and affected not just one but two
different machines with RAID-5 arrays. With that change reverted, the
problem is not reproducible. Then when I recently upgraded to 6.8-rc5
I immediately hit the problem again (because it hadn't been reverted
in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
machines without issue after reverting that commit on top of it.

It would seem a very unlikely coincidence that a careful bisection of
all changes between 6.7.0 and 6.7.1 pointed at that commit as being
the culprit, and that the change is to raid5.c, and that the hang
happens in the raid5 kernel task, if there was no connection. :)

> Are you able to reproduce using some regular scsi disk, would like to
> rule out whether this is related with virtio-scsi?

The first time I hit this problem was on two bare-metal machines, one
server and one desktop with different hardware. I then set up this
virtual machine just to reproduce the problem in a different
environment (and to see if I could reproduce it with a distribution
kernel since the other machines are running custom kernel
configurations). So I'm able to reproduce it on:

- A virtual machine
- Bare metal machines
- Custom kernel configuration with straight from stable and mainline code
- Fedora 39 distribution kernel

> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
> official mainline v6.8-rc5 without any other patches?

Yes this particular vmcore was from the Fedora 39 VM I used to
reproduce the problem, but with the straight 6.8.0-rc5 mainline code
(so that you wouldn't have to worry about any possible interference
from distribution patches).

Cheers,

-- Dan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ