lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO9zADxCYgQVOD9A1WYoS4JcLgvsNtGGr4xEZm9CMFHXsTV8ww@mail.gmail.com>
Date: Tue, 25 Nov 2025 09:42:11 -0500
From: Justin Piszcz <jpiszcz@...idpixels.com>
To: LKML <linux-kernel@...r.kernel.org>, linux-nvme@...ts.infradead.org, 
	linux-raid@...r.kernel.org, Btrfs BTRFS <linux-btrfs@...r.kernel.org>
Subject: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)

Hello,

Issue/Summary:
1. Usually once a month, a random WD Red SN700 4TB NVME drive will
drop out of a NAS array, after power cycling the device, it rebuilds
successfully.

Details:
0. I use an NVME NAS (FS6712X) with WD Red SN700 4TB drives (WDS400T1R0C):
1. Ever since I installed the drives, there will be a random drive
that drops offline every month or so, almost always when the system is
idle.
2. I have troubleshot this with Asustor and WD/SanDisk.
3. Asustor noted that they did have other users with the same
configuration running into this problem.
4. When troubleshooting with WD/SanDisk's it was noted my main option
is to replace the drive, even though the issue occurs across nearly
all of the drives.
5. The drives are up to date currently according to the WD Dashboard
(when removing them and checking them on another system).
6. As for the device/filesystem, the FS6712X's configuration is
MD-RAID6 device with BTRFS on-top of it.
7. The "workaround" is to power cycle the FS6712X and when it boots up
the MD-RAID6 re-syncs back to a healthy state.

I am using the latest Asus ADM/OS which uses the 6.6.x kernel:
1. Linux FS6712X-EB92 6.6.x #1 SMP PREEMPT_DYNAMIC Tue Nov  4 00:53:39
CST 2025 x86_64 GNU/Linux

Questions:
1. Have others experienced this failure scenario?
2. Are there identified workarounds for this issue outside of power
cycling the device when this happens?
3. Are there any debug options that can be enabled that could help to
pinpoint the root cause?
4. Within the BIOS settings, which starts 2:18 below there are some
advanced settings that are shown, could there be a power saving
feature or other setting that can be modified to address this issue?
4a. https://www.youtube.com/watch?v=YytWFtgqVy0

[1] The last failures have been at random times on the following days:
1. August 27, 2025
2. September 19th, 2025
3. September 29th, 2025
4. October 28th, 2025
5. November 24, 2025

Chipset being used:
1. ASMedia Technology Inc.:ASM2806 4-Port PCIe x2 Gen3 Packet Switch

Details:

1. August 27, 2025
[1156824.598513] nvme nvme2: I/O 5 QID 0 timeout, reset controller
[1156896.035355] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[1156906.057936] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[1158185.737571] md/raid:md1: Disk failure on nvme2n1p4, disabling device.
[1158185.744188] md/raid:md1: Operation continuing on 11 devices.

2. September 19th, 2025
[2001664.727044] nvme nvme9: I/O 26 QID 0 timeout, reset controller
[2001736.159123] nvme nvme9: Device not ready; aborting reset, CSTS=0x1
[2001746.180813] nvme nvme9: Device not ready; aborting reset, CSTS=0x1
[2002368.631788] md/raid:md1: Disk failure on nvme9n1p4, disabling device.
[2002368.638414] md/raid:md1: Operation continuing on 11 devices.
[2003213.517965] md/raid1:md0: Disk failure on nvme9n1p2, disabling device.
[2003213.517965] md/raid1:md0: Operation continuing on 11 devices.

3.  September 29th, 2025
[858305.408049] nvme nvme3: I/O 8 QID 0 timeout, reset controller
[858376.843140] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
[858386.865240] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
[858386.883053] md/raid:md1: Disk failure on nvme3n1p4, disabling device.
[858386.889586] md/raid:md1: Operation continuing on 11 devices.

4. October 28th, 2025
[502963.821407] nvme nvme4: I/O 0 QID 0 timeout, reset controller
[503035.257391] nvme nvme4: Device not ready; aborting reset, CSTS=0x1
[503045.282923] nvme nvme4: Device not ready; aborting reset, CSTS=0x1
[503142.226962] md/raid:md1: Disk failure on nvme4n1p4, disabling device.
[503142.233496] md/raid:md1: Operation continuing on 11 devices.

5. November 24th, 2025
[1658454.034633] nvme nvme2: I/O 24 QID 0 timeout, reset controller
[1658525.470287] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[1658535.491803] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[1658535.517638] md/raid1:md0: Disk failure on nvme2n1p2, disabling device.
[1658535.517638] md/raid1:md0: Operation continuing on 11 devices.
[1659258.368386] md/raid:md1: Disk failure on nvme2n1p4, disabling device.
[1659258.375012] md/raid:md1: Operation continuing on 11 devices.


Justin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ