lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date: Mon, 1 Jul 2024 08:23:53 -0400
From: Justin Piszcz <jpiszcz@...idpixels.com>
To: LKML <linux-kernel@...r.kernel.org>
Subject: 6.1.0: NVME drive goes offline randomly even with:
 nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Hello,

Kernel: 6.1.0-17-amd64
Distribution: Debian stable
Arch: x86_64

I have 2 NVME drives as part of a BTRFS RAID-1, initially when this
happened the first time I added the following to the kernel cmdline at
boot:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

This greatly reduced the frequency of this issue (last uptime was ~70
days).  However, it has occurred twice since then, this time I had
netconsole up to capture the crash.

The full kernel netconsole before during and after the crash:
https://installkernel.tripod.com/20240701-6.1.0-crash.txt

The model & firmware version of both drives are identical:
Model Number:                       Samsung SSD 990 PRO with Heatsink 4TB
Firmware Version:                   4B2QJXD7

Motherboard being used:
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: Pro WS W680-ACE IPMI

Is there a workaround or potential fix for this issue?

The issue starts when this occurs:
[6078737.345641] nvme nvme2: I/O 154 (I/O Cmd) QID 6 timeout, aborting
[6078737.348143] nvme nvme2: I/O 155 (I/O Cmd) QID 6 timeout, aborting

Then later, a kernel panic:
[6078894.702941] BTRFS error (device nvme0n1p2): error writing primary
super block to device 2
[6078894.707920] BTRFS warning (device nvme0n1p2): csum hole found for
disk bytenr range [3659038877598419968, 3659038877598424064)
[6078894.708310] BTRFS critical (device nvme0n1p2): unable to find
chunk map for logical 3659038877598419968 length 4096
[6078894.708652] BUG: kernel NULL pointer dereference, address: 000000000000005a
[6078894.708879] #PF: supervisor read access in kernel mode
[6078894.709107] #PF: error_code(0x0000) - not-present page
[6078894.709292] PGD 0 P4D 0
[6078894.709509] Oops: 0000 [#1] PREEMPT SMP NOPTI
[6078894.709692] CPU: 12 PID: 3349611 Comm: kworker/u64:18 Not tainted
6.1.0-17-amd64 #1  Debian 6.1.69-1
[6078894.709856] Hardware name: ASUSTeK COMPUTER INC. System Product
Name/Pro WS W680-ACE IPMI, BIOS 3401 03/19/2024
[6078894.710022] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[6078894.710267] RIP: 0010:btrfs_get_io_geometry+0x13/0xf0 [btrfs]
[6078894.710483] Code: f4 ff ff ff e9 67 ff ff ff 66 66 2e 0f 1f 84 00
00 00 00 00 0f 1f 00 0f 1f 44 00 00 41 56 49 89 c9 48 89 cf 41 55 41
54 55 53 <4c> 8b 76 70 89 d3 31 d2 4c 8b 5e 18 41 8b 4e 10 45 8b 6e 14
4d 29
[6078894.710692] RSP: 0018:ffffa9cfc6657c08 EFLAGS: 00010286
[6078894.710711] BTRFS error (device nvme0n1p2): error writing primary
super block to device 2
[6078894.710876] RAX: ffffffffffffffea RBX: ffffffffffffffea RCX:
32c7847906990c00
[6078894.710882] RDX: 0000000000000000 RSI: ffffffffffffffea RDI:
32c7847906990c00
[6078894.710882] RBP: ffffa9cfc6657d28 R08: ffffa9cfc6657cc8 R09:
32c7847906990c00
[6078894.710882] R10: 0000000000000003 R11: ffff9a3efff6dc28 R12:
ffff9a2018195000
[6078894.710883] R13: 0000000000000001 R14: 0000000000001000 R15:
ffffa9cfc6657d50
[6078894.710884] FS:  0000000000000000(0000) GS:ffff9a3e7fb00000(0000)
knlGS:0000000000000000
[6078894.710884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6078894.710885] CR2: 000000000000005a CR3: 0000000bfc210000 CR4:
0000000000750ee0
[6078894.710885] PKRU: 55555554
[6078894.710885] Call Trace:
[6078894.710887]  <TASK>
[6078894.710891]  ? page_fault_oops+0xd2/0x2b0
[6078894.710889]  ? __die_body.cold+0x1a/0x1f
[6078894.710893]  ? exc_page_fault+0x70/0x170
[6078894.715724]  ? asm_exc_page_fault+0x22/0x30
[6078894.716084]  ? btrfs_get_io_geometry+0x13/0xf0 [btrfs]
[6078894.716470] BTRFS error (device nvme0n1p2): error writing primary
super block to device 2
[6078894.716462]  ? btrfs_get_chunk_map.cold+0x15/0x42 [btrfs]
[6078894.717384]  __btrfs_map_block+0xc4/0xe40 [btrfs]
[6078894.717771]  ? kmem_cache_free+0x15/0x310
[6078894.718147]  btrfs_submit_bio+0xa2/0x240 [btrfs]
[6078894.718571]  btrfs_repair_one_sector+0x29f/0x3a0 [btrfs]
[6078894.718972]  ? btrfs_submit_data_write_bio+0x110/0x110 [btrfs]
[6078894.719364]  end_compressed_bio_read+0x118/0x2f0 [btrfs]
[6078894.719753]  process_one_work+0x1c4/0x380
[6078894.720135]  worker_thread+0x4d/0x380
[6078894.720510]  ? rescuer_thread+0x3a0/0x3a0
[6078894.720864]  kthread+0xd7/0x100
[6078894.721224]  ? kthread_complete_and_exit+0x20/0x20
[6078894.721579]  ret_from_fork+0x1f/0x30
[6078894.721984]  </TASK>

Justin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ