lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABXGCsMiKe31UaoMV02gW4iJSKnBiO5jGQKej=Zem24mD0ObQw@mail.gmail.com>
Date:   Thu, 5 May 2022 06:58:11 +0500
From:   Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
To:     linux@...mhuis.info,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        linux-nvme@...ts.infradead.org, luto@...nel.org
Subject: [BUG][5.18rc5] nvme nvme0: controller is down; will reset:
 CSTS=0xffffffff, PCI_STATUS=0x10

Hi,
Today, for the first time, I encountered the fact that my new nvme disk is down.
In the kernel logs, I found the following sequence of messages:

[ 3005.869069] [drm] free PSP TMR buffer
[ 4626.562712] nvme nvme0: controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0x10
[ 4626.584716] nvme 0000:06:00.0: enabling device (0000 -> 0002)
[ 4626.585006] nvme nvme0: Removing after probe failure status: -19
[ 4626.590776] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590784] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590797] nvme0n1: detected capacity change from 7814037168 to 0
[ 4626.590814] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590816] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590816] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590832] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590835] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590838] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590847] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[ 4626.590847] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3
errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
[ 4626.593059] BTRFS: error (device nvme0n1p3) in
btrfs_commit_transaction:2418: errno=-5 IO failure (Error while
writing out transaction)
[ 4626.593075] BTRFS info (device nvme0n1p3: state E): forced readonly
[ 4626.593099] BTRFS warning (device nvme0n1p3: state E): Skipping
commit of aborted transaction.
[ 4626.593107] BTRFS: error (device nvme0n1p3: state EA) in
cleanup_transaction:1982: errno=-5 IO failure
[ 4626.593137] BTRFS: error (device nvme0n1p3: state EA) in
btrfs_sync_log:3331: errno=-5 IO failure

Googling turned up a lot of links to various old reports (4.xx kernel)
and APST issue reports.
In a bug report on kernel.org [6], the unfortunate users talking with
each other with no hope of a solution being found.
The most clarifying article turned out to be [1].

After analyzing the answer of the commands "nvme id-ctrl /dev/nvme0"
and "cat /sys/module/nvme_core/parameters/default_ps_max_latency_us".

# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid       : 0x1bb1
ssvid     : 0x1bb1
sn        : 7VS00CLE
mn        : Seagate FireCuda 530 ZP4000GM30013
fr        : SU6SM001
[...]
ps    0 : mp:8.80W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:7.10W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:5.20W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0620W non-operational enlat:2500 exlat:7500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0440W non-operational enlat:10500 exlat:65000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

# cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
100000

I concluded that my problem is not related to APST because 2500 + 7500
+ 10500 + 65000 = 85500 < 100000
100000 is greater than the total latency of any state (enlat + xlat).

Or am I misinterpreting the results?

Therefore, I would like to ask if there are any other ideas why nvme
can stop working with the message "controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0x10", which generally does not say
anything about the reason why this happened.

My kernel is 5.18rc5.

Thanks in advance for any answer that will clear things up. And where
to dig in search of a solution to the problem.

[1] https://wiki.archlinux.org/title/Solid_state_drive/NVMe
[2] [# smartctl -a /dev/nvme0] - https://pastebin.com/JwSXwu6c
[3] [# nvme get-feature /dev/nvme0 -f 0x0c -H] - https://pastebin.com/KZ6FjhGt
[4] [# nvme id-ctrl /dev/nvme0] - https://pastebin.com/seEkPfF7
[5] [full dmesg] - https://pastebin.com/aNEaqtCV
[6] [bug report about Samsung PM951 NVMe] -
https://bugzilla.kernel.org/show_bug.cgi?id=195039

-- 
Best Regards,
Mike Gavrilov.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ