linux-kernel - "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACbz5yFi8hiPD0Avg=KVTpdEeJC=3MMSkGtYOBQ=OmB_O0khyQ@mail.gmail.com>
Date: Mon, 17 Nov 2025 14:39:17 +0100
From: Thomas ten Cate <ttencate@...il.com>
To: Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...com>, Christoph Hellwig <hch@....de>, 
	Sagi Grimberg <sagi@...mberg.me>
Cc: linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: "controller is down; will reset" on SK Hynix NVMe drive in Lenovo
 IdeaPad Pro 5

Dear kernel heroes,

I'm encountering errors with the NVMe drive in my laptop, which appear
to be related to power saving modes (search keywords: APST, ASPM). It
got more serious in some recent kernel version, but seems to have been
present before.

After just booting, starting `dmesg -w` and waiting a bit, the log says:

[   43.710561] could not locate request for tag 0x0
[   43.710585] nvme nvme0: invalid id 0 completed on queue 1
[   43.710593] could not locate request for tag 0x0
[   43.710598] nvme nvme0: invalid id 0 completed on queue 1
[   43.710603] could not locate request for tag 0x0
[   43.710607] nvme nvme0: invalid id 0 completed on queue 1
[   43.710611] could not locate request for tag 0x0
[   43.710615] nvme nvme0: invalid id 0 completed on queue 1
[   73.744791] nvme nvme0: I/O tag 129 (4081) opcode 0x1 (Write) QID 1
timeout, aborting req_op:WRITE(1) size:32768
[   73.744862] nvme nvme0: I/O tag 130 (a082) opcode 0x1 (Write) QID 1
timeout, aborting req_op:WRITE(1) size:36864
[   73.744875] nvme nvme0: I/O tag 131 (8083) opcode 0x1 (Write) QID 1
timeout, aborting req_op:WRITE(1) size:4096
[   73.744886] nvme nvme0: I/O tag 133 (5085) opcode 0x1 (Write) QID 1
timeout, aborting req_op:WRITE(1) size:12288
[   73.756694] nvme nvme0: Abort status: 0x0
[   73.757641] nvme nvme0: Abort status: 0x0
[   73.758533] nvme nvme0: Abort status: 0x0
[   73.759422] nvme nvme0: Abort status: 0x0
[  103.824976] nvme nvme0: I/O tag 129 (4081) opcode 0x1 (Write) QID 1
timeout, reset controller
[  103.966268] nvme nvme0: 16/0/0 default/read/poll queues

Notice the 30 second delays. This problem has been present at since
6.12.40 stable or maybe earlier, but has gone unnoticed until now
because things apparently recovered. Full log of a similar occasion:
https://gist.github.com/ttencate/9f2c4739d9e8a4c0142fd8246b56a7d6

More recently, since 6.12.56 or maybe earlier, I'm also sometimes getting these:

[  336.613637] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected 0x9)
[  336.613659] nvme nvme0: invalid id 0 completed on queue 8
[  366.657750] nvme nvme0: controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0x10
[  366.657768] nvme nvme0: Does your device have a faulty power saving
mode enabled?
[  366.657773] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0
pcie_aspm=off pcie_port_pm=off" and report a bug
[  366.761391] nvme 0000:03:00.0: enabling device (0000 -> 0002)
[  366.761842] nvme nvme0: Disabling device after reset failure: -19

In this case, the messages are followed by a slew of btrfs errors,
btrfs switches to read-only mode, and the drive becomes entirely
inaccessible until a reboot.

The log suggests to add the kernel arguments
"nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
pcie_port_pm=off", which indeed makes all issues go away.

I haven't found a reliable way to trigger the latter error
specifically, though doing something I/O heavy like compiling a kernel
seems to make it more likely. This makes bisect difficult to do, but
it's clear that something was going on in previous versions as well,
so I wouldn't necessarily call this a regression. Either way, the
issue is still present in mainline 6.17.8.

Since it happens only after some idle time, and disabling PM fixes it,
this seems related to power states. But of course, I cannot completely
rule out faulty hardware either.

Machine: Lenovo IdeaPad Pro 5 16APH8
Architecture: x86_64
NVMe drive: SK Hynix HFS001TEJ4X112N
Full lshw output:
https://gist.github.com/ttencate/5540c81454bbe1fa679955effba65eba

Distribution: Arch Linux
Kernel version: 6.17.8 (vanilla from commit 8ac42a6)
Kernel configuration:
https://gitlab.archlinux.org/archlinux/packaging/packages/linux-lts/-/blob/b0cac6a69041703bbe1aba4a2a269585d77b108b/config
(plus `make olddefconfig`)
GCC version: 15.2.1

This is my first kernel bug report, so I hope I didn't miss anything;
if I did, please let me know. I'd be happy to experiment or try out
patches.

Kind regards,

Thomas ten Cate