[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAO9zADwJ_AXMJTjBLXkO_A0JhsXzsfqtx92DbSq_gcG9LPLZ_w@mail.gmail.com>
Date: Mon, 1 Jul 2024 08:23:53 -0400
From: Justin Piszcz <jpiszcz@...idpixels.com>
To: LKML <linux-kernel@...r.kernel.org>
Subject: 6.1.0: NVME drive goes offline randomly even with:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
Hello,
Kernel: 6.1.0-17-amd64
Distribution: Debian stable
Arch: x86_64
I have 2 NVME drives as part of a BTRFS RAID-1, initially when this
happened the first time I added the following to the kernel cmdline at
boot:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
This greatly reduced the frequency of this issue (last uptime was ~70
days). However, it has occurred twice since then, this time I had
netconsole up to capture the crash.
The full kernel netconsole before during and after the crash:
https://installkernel.tripod.com/20240701-6.1.0-crash.txt
The model & firmware version of both drives are identical:
Model Number: Samsung SSD 990 PRO with Heatsink 4TB
Firmware Version: 4B2QJXD7
Motherboard being used:
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: Pro WS W680-ACE IPMI
Is there a workaround or potential fix for this issue?
The issue starts when this occurs:
[6078737.345641] nvme nvme2: I/O 154 (I/O Cmd) QID 6 timeout, aborting
[6078737.348143] nvme nvme2: I/O 155 (I/O Cmd) QID 6 timeout, aborting
Then later, a kernel panic:
[6078894.702941] BTRFS error (device nvme0n1p2): error writing primary
super block to device 2
[6078894.707920] BTRFS warning (device nvme0n1p2): csum hole found for
disk bytenr range [3659038877598419968, 3659038877598424064)
[6078894.708310] BTRFS critical (device nvme0n1p2): unable to find
chunk map for logical 3659038877598419968 length 4096
[6078894.708652] BUG: kernel NULL pointer dereference, address: 000000000000005a
[6078894.708879] #PF: supervisor read access in kernel mode
[6078894.709107] #PF: error_code(0x0000) - not-present page
[6078894.709292] PGD 0 P4D 0
[6078894.709509] Oops: 0000 [#1] PREEMPT SMP NOPTI
[6078894.709692] CPU: 12 PID: 3349611 Comm: kworker/u64:18 Not tainted
6.1.0-17-amd64 #1 Debian 6.1.69-1
[6078894.709856] Hardware name: ASUSTeK COMPUTER INC. System Product
Name/Pro WS W680-ACE IPMI, BIOS 3401 03/19/2024
[6078894.710022] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[6078894.710267] RIP: 0010:btrfs_get_io_geometry+0x13/0xf0 [btrfs]
[6078894.710483] Code: f4 ff ff ff e9 67 ff ff ff 66 66 2e 0f 1f 84 00
00 00 00 00 0f 1f 00 0f 1f 44 00 00 41 56 49 89 c9 48 89 cf 41 55 41
54 55 53 <4c> 8b 76 70 89 d3 31 d2 4c 8b 5e 18 41 8b 4e 10 45 8b 6e 14
4d 29
[6078894.710692] RSP: 0018:ffffa9cfc6657c08 EFLAGS: 00010286
[6078894.710711] BTRFS error (device nvme0n1p2): error writing primary
super block to device 2
[6078894.710876] RAX: ffffffffffffffea RBX: ffffffffffffffea RCX:
32c7847906990c00
[6078894.710882] RDX: 0000000000000000 RSI: ffffffffffffffea RDI:
32c7847906990c00
[6078894.710882] RBP: ffffa9cfc6657d28 R08: ffffa9cfc6657cc8 R09:
32c7847906990c00
[6078894.710882] R10: 0000000000000003 R11: ffff9a3efff6dc28 R12:
ffff9a2018195000
[6078894.710883] R13: 0000000000000001 R14: 0000000000001000 R15:
ffffa9cfc6657d50
[6078894.710884] FS: 0000000000000000(0000) GS:ffff9a3e7fb00000(0000)
knlGS:0000000000000000
[6078894.710884] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6078894.710885] CR2: 000000000000005a CR3: 0000000bfc210000 CR4:
0000000000750ee0
[6078894.710885] PKRU: 55555554
[6078894.710885] Call Trace:
[6078894.710887] <TASK>
[6078894.710891] ? page_fault_oops+0xd2/0x2b0
[6078894.710889] ? __die_body.cold+0x1a/0x1f
[6078894.710893] ? exc_page_fault+0x70/0x170
[6078894.715724] ? asm_exc_page_fault+0x22/0x30
[6078894.716084] ? btrfs_get_io_geometry+0x13/0xf0 [btrfs]
[6078894.716470] BTRFS error (device nvme0n1p2): error writing primary
super block to device 2
[6078894.716462] ? btrfs_get_chunk_map.cold+0x15/0x42 [btrfs]
[6078894.717384] __btrfs_map_block+0xc4/0xe40 [btrfs]
[6078894.717771] ? kmem_cache_free+0x15/0x310
[6078894.718147] btrfs_submit_bio+0xa2/0x240 [btrfs]
[6078894.718571] btrfs_repair_one_sector+0x29f/0x3a0 [btrfs]
[6078894.718972] ? btrfs_submit_data_write_bio+0x110/0x110 [btrfs]
[6078894.719364] end_compressed_bio_read+0x118/0x2f0 [btrfs]
[6078894.719753] process_one_work+0x1c4/0x380
[6078894.720135] worker_thread+0x4d/0x380
[6078894.720510] ? rescuer_thread+0x3a0/0x3a0
[6078894.720864] kthread+0xd7/0x100
[6078894.721224] ? kthread_complete_and_exit+0x20/0x20
[6078894.721579] ret_from_fork+0x1f/0x30
[6078894.721984] </TASK>
Justin
Powered by blists - more mailing lists