lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240123005700.9302-1-dan@danm.net>
Date: Mon, 22 Jan 2024 17:56:58 -0700
From: Dan Moulding <dan@...m.net>
To: Song Liu <song@...nel.org>
Cc: regressions@...ts.linux.dev,
	linux-raid@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	stable@...r.kernel.org,
	Junxiao Bi <junxiao.bi@...cle.com>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Dan Moulding <dan@...m.net>
Subject: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

After upgrading from 6.7.0 to 6.7.1 a couple of my systems with md
RAID-5 arrays started experiencing hangs. It starts with some
processes which write to the array getting stuck. The whole system
eventually becomes unresponsive and unclean shutdown must be performed
(poweroff and reboot don't work).

While trying to diagnose the issue, I noticed that the md0_raid5
kernel thread consumes 100% CPU after the issue occurs. No relevant
warnings or errors were found in dmesg.

On 6.7.1, I can reproduce the issue somewhat reliably by copying a
large amount of data to the array. I am unable to reproduce the issue
at all on 6.7.0. The bisection was a bit difficult since I don't have
a 100% reliable method to reproduce the problem, but with some
perseverence I eventually managed to whittle it down to commit
0de40f76d567 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
raid5d"). After reverting that commit (i.e. reapplying the reverted
commit) on top of 6.7.1 I can no longer reproduce the problem at all.

Some details that might be relevant:
- Both systems are running MD RAID-5 with a journal device.
- mdadm in monitor mode is always running on both systems.
- Both systems were previously running 6.7.0 and earlier just fine.
- The older of the two systems has been running a raid5 array without
  incident for many years (kernel going back to at least 5.1) -- this
  is the first raid5 issue it has encountered.

Please let me know if there is any other helpful information that I
might be able to provide.

-- Dan

#regzbot introduced: 0de40f76d567133b871cd6ad46bb87afbce46983

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ