linux-kernel - [BUG][5.18rc5] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABXGCsMiKe31UaoMV02gW4iJSKnBiO5jGQKej=Zem24mD0ObQw@mail.gmail.com>
Date:   Thu, 5 May 2022 06:58:11 +0500
From:   Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
To:     linux@...mhuis.info,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        linux-nvme@...ts.infradead.org, luto@...nel.org
Subject: [BUG][5.18rc5] nvme nvme0: controller is down; will reset:
 CSTS=0xffffffff, PCI_STATUS=0x10

Hi,
Today, for the first time, I encountered the fact that my new nvme disk is down.
In the kernel logs, I found the following sequence of messages:

[ 3005.869069] [drm] free PSP TMR buffer
[ 4626.562712] nvme nvme0: controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0x10
[ 4626.584716] nvme 0000:06:00.0: enabling device (0000 -> 0002)
[ 4626.585006] nvme nvme0: Removing after probe failure status: -19
[ 4626.590776] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590784] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590797] nvme0n1: detected capacity change from 7814037168 to 0
[ 4626.590814] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590816] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590816] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590832] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590835] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590838] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590847] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590847] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
[ 4626.593059] BTRFS: error (device nvme0n1p3) in
btrfs_commit_transaction:2418: errno=-5 IO failure (Error while
writing out transaction)
[ 4626.593075] BTRFS info (device nvme0n1p3: state E): forced readonly
[ 4626.593099] BTRFS warning (device nvme0n1p3: state E): Skipping
commit of aborted transaction.
[ 4626.593107] BTRFS: error (device nvme0n1p3: state EA) in
cleanup_transaction:1982: errno=-5 IO failure
[ 4626.593137] BTRFS: error (device nvme0n1p3: state EA) in
btrfs_sync_log:3331: errno=-5 IO failure

Googling turned up a lot of links to various old reports (4.xx kernel)
and APST issue reports.
In a bug report on kernel.org [6], the unfortunate users talking with
each other with no hope of a solution being found.
The most clarifying article turned out to be [1].

After analyzing the answer of the commands "nvme id-ctrl /dev/nvme0"
and "cat /sys/module/nvme_core/parameters/default_ps_max_latency_us".

# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid       : 0x1bb1
ssvid     : 0x1bb1
sn        : 7VS00CLE
mn        : Seagate FireCuda 530 ZP4000GM30013
fr        : SU6SM001
[...]
ps    0 : mp:8.80W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:7.10W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:5.20W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0620W non-operational enlat:2500 exlat:7500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0440W non-operational enlat:10500 exlat:65000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

# cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
100000

I concluded that my problem is not related to APST because 2500 + 7500
+ 10500 + 65000 = 85500 < 100000
100000 is greater than the total latency of any state (enlat + xlat).

Or am I misinterpreting the results?

Therefore, I would like to ask if there are any other ideas why nvme
can stop working with the message "controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0x10", which generally does not say
anything about the reason why this happened.

My kernel is 5.18rc5.

Thanks in advance for any answer that will clear things up. And where
to dig in search of a solution to the problem.

[1] https://wiki.archlinux.org/title/Solid_state_drive/NVMe
[2] [# smartctl -a /dev/nvme0] - https://pastebin.com/JwSXwu6c
[3] [# nvme get-feature /dev/nvme0 -f 0x0c -H] - https://pastebin.com/KZ6FjhGt
[4] [# nvme id-ctrl /dev/nvme0] - https://pastebin.com/seEkPfF7
[5] [full dmesg] - https://pastebin.com/aNEaqtCV
[6] [bug report about Samsung PM951 NVMe] -
https://bugzilla.kernel.org/show_bug.cgi?id=195039

-- 
Best Regards,
Mike Gavrilov.