[<prev] [next>] [day] [month] [year] [list]
Message-ID: <3c6be8cb-753a-6efb-e2a6-5686740e6361@gmail.com>
Date: Mon, 14 Apr 2025 15:08:36 +0800
From: Linjun Bao <meljbao@...il.com>
To: Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...com>,
Christoph Hellwig <hch@....de>, sagi@...mberg.me,
linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: nvme: potential stack collision when running nvme_setup_io_queues and
nvme_user_cmd in parallel
Hi there,
This is Linjun Bao from Solidigm, a SSD vendor. Recently we get some drive drop report from customer, and we confirm this is solid on the latest kernel.
## Issue Description
This issue happens across different platform, from Intel BHS server to several desktop system. This issue happens on newer kernel including
5.19, 6.0, 6.3 and 6.6, but older kernel 5.14 does not have such issue. This issue happens on SSDs from different vendors.
This issue is easily reproduced under test scenario below. Two scripts are running concurrently, one is keeping sending nvme reset to the SSD, another is keeping querying smart log from the same SSD. After several reset cycles, when issue happens, the dmesg message prints out "nvme nvme0: I/O 11 QID 0 timeout, disable controller, nvme nvme0: Disabling device after reset failure: -4".
Using the kernel 6.6.85 (almost latest) to debug, from both Lecory PCIe trace and kernel ftrace, it shows the host stops sending out nvme_admin_create_cq or nvme_admin_create_sq during the nvme_reset_work. This issue happens only when the nvme_queue_rq() from user cmd executes just after the nvme_queue_rq() from nvme_setup_io_queues(), and with the same CPU core and thread.
In the example below, I marked "Q" as parent cmd is nvme_setup_io_queues(). "L" as parent cmd is user cmd of getting smart log.
```
13) <idle>-0 => kworker-151717
------------------------------------------
13)Q ! 561.302 us | } /* nvme_execute_rq [nvme_core] */
19)L 0.125 us | nvme_init_request [nvme_core]();
13)Q ! 563.722 us | } /* nvme_submit_sync_cmd [nvme_core] */
19)L 1.244 us | }
19)L 1.549 us | nvme_map_user_request [nvme_core]();
13)Q | nvme_submit_sync_cmd [nvme_core]() {
13)Q 0.368 us | nvme_init_request [nvme_core]();
19)L | nvme_passthru_start [nvme_core]() {
13)Q | nvme_execute_rq [nvme_core]() {
19)L 0.140 us | nvme_command_effects [nvme_core]();
19)L 0.348 us | }
19)L + 13.250 us | nvme_execute_rq [nvme_core]();
13)Q | nvme_queue_rq [nvme]() {
13)Q | nvme_setup_cmd [nvme_core]() {
13)Q | /* nvme_setup_cmd: nvme0: qid=0, cmdid=8214, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_create_sq sqid=19, qsize=1023, sq_flags=0x1, cqid=19) */
13)Q 1.716 us | }
13)Q 0.494 us | nvme_prep_rq.part.0 [nvme]();
13)Q 3.718 us | }
13)L | nvme_queue_rq [nvme]() {
13)L | nvme_fail_nonready_command [nvme_core]() {
13)L | nvme_complete_rq [nvme_core]() {
13)L | /* nvme_complete_rq: nvme0: qid=0, cmdid=0, opcode=2, res=0x0, retries=0, flags=0x2, status=0x370 */
13)L | nvme_end_req [nvme_core]() {
13)L 0.411 us | nvme_error_status [nvme_core]();
13)L 3.194 us | }
13)L 4.299 us | }
13)L 4.952 us | }
13)L 5.729 us | }
19)L 0.158 us | nvme_passthru_end [nvme_core]();
19)L + 18.486 us | }
19)L + 19.510 us | }
19)L + 19.731 us | }
19)L + 19.966 us | }
19)L | nvme_release [nvme_core]() {
19)L 0.159 us | nvme_ns_release [nvme_core]();
19)L 0.433 us | }
// this nvme_admin_create_sq completion irq comes way later, 60s after the host sets nvme admin cmd timeout
## line 3165365
12) nvme-303232 => <idle>-0
------------------------------------------
12) | nvme_irq [nvme]() {
12) | /* nvme_sq: nvme0: qid=0, head=27, tail=27 */
12) 1.341 us | }
12) | nvme_pci_complete_rq [nvme]() {
12) | nvme_complete_rq [nvme_core]() {
12) | /* nvme_complete_rq: nvme0: qid=0, cmdid=8214, opcode=1, res=0x0, retries=0, flags=0x1, status=0x0 */
12) | nvme_end_req [nvme_core]() {
12) 0.076 us | nvme_error_status [nvme_core]();
12) 1.486 us | }
12) 1.794 us | }
12) 2.087 us | }
```
A call stack ftrace shows more. Among the lines that I comment "wield behavior here," the call stack for nvme_fail_nonready_command <-nvme_queue_rq seems wrong, for an user nvme smart cmd during nvme_create_io_queues(), it should go the path nvme_fail_nonready_command <- nvme_queue_rq <- blk_mq_dispatch_rq_list <-...<- nvme_execute_rq <- nvme_submit_user_cmd since this NVMe is in CONNECTING status, but the trace shows nvme_fail_nonready_command <- nvme_queue_rq <- ... <- nvme_submit_sync_cmd <- nvme_setup_io_queues. This is wrong, since this user cmd is now using the same thread kworker/u40:5-302 and cpu core #19, I believe some stack collision happens, and this user cmd wrongly uses the stack for the sync cmd from nvme_setup_io_queues, that's the reason why there is no further create sq/cq cmds afterwards. But I do not know how to debug further, please do help guide me.
```
=> cpu_startup_entry
=> start_secondary
=> secondary_startup_64_no_verify
<...>-20136 [018]L ..... 744.363379: nvme_open <-blkdev_get_whole
kworker/u40:5-302 [019]Q ..... 744.363379: nvme_setup_io_queues_trylock <-nvme_setup_io_queues
kworker/u40:5-302 [019]Q ..... 744.363380: <stack trace>
=> nvme_setup_io_queues_trylock
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
kworker/u40:5-302 [019]Q ..... 744.363381: nvme_init_queue <-nvme_setup_io_queues
kworker/u40:5-302 [019]Q ..... 744.363382: <stack trace>
=> nvme_init_queue
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
<...>-20136 [018]L ..... 744.363383: <stack trace>
=> nvme_open
=> blkdev_get_whole
=> blkdev_get_by_dev
=> blkdev_open
=> do_dentry_open
=> vfs_open
=> path_openat
=> do_filp_open
=> do_sys_openat2
=> do_sys_open
=> __x64_sys_openat
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363383: nvme_ns_open <-nvme_open
<...>-20136 [018]L ..... 744.363385: <stack trace>
=> nvme_ns_open
=> nvme_open
=> blkdev_get_whole
=> blkdev_get_by_dev
=> blkdev_open
=> do_dentry_open
=> vfs_open
=> path_openat
=> do_filp_open
=> do_sys_openat2
=> do_sys_open
=> __x64_sys_openat
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363390: nvme_ioctl <-blkdev_ioctl
<...>-20136 [018]L ..... 744.363391: <stack trace>
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363391: nvme_ctrl_ioctl <-nvme_ioctl
<...>-20136 [018]L ..... 744.363392: <stack trace>
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363392: nvme_user_cmd.constprop.0 <-nvme_ctrl_ioctl
<...>-20136 [018]L ..... 744.363393: <stack trace>
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363393: nvme_validate_passthru_nsid.isra.0 <-nvme_user_cmd.constprop.0
<...>-20136 [018]L ..... 744.363394: <stack trace>
=> nvme_validate_passthru_nsid.isra.0
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363394: nvme_cmd_allowed <-nvme_user_cmd.constprop.0
<...>-20136 [018]L ..... 744.363395: <stack trace>
=> nvme_cmd_allowed
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363395: nvme_submit_user_cmd <-nvme_user_cmd.constprop.0
<...>-20136 [018]L ..... 744.363396: <stack trace>
=> nvme_submit_user_cmd
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363396: nvme_alloc_user_request <-nvme_submit_user_cmd
<...>-20136 [018]L ..... 744.363397: <stack trace>
=> nvme_alloc_user_request
=> nvme_submit_user_cmd
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363399: nvme_init_request <-nvme_alloc_user_request
<...>-20136 [018]L ..... 744.363400: <stack trace>
=> nvme_init_request
=> nvme_alloc_user_request
=> nvme_submit_user_cmd
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363400: nvme_map_user_request <-nvme_submit_user_cmd
<...>-20136 [018]L ..... 744.363401: <stack trace>
=> nvme_map_user_request
=> nvme_submit_user_cmd
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
kworker/u40:5-302 [019]Q ..... 744.363402: nvme_submit_sync_cmd <-nvme_setup_io_queues
kworker/u40:5-302 [019]Q ..... 744.363403: <stack trace>
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
<...>-20136 [018]L ..... 744.363404: nvme_passthru_start <-nvme_submit_user_cmd
kworker/u40:5-302 [019]Q ..... 744.363405: nvme_init_request <-__nvme_submit_sync_cmd
<...>-20136 [018]L ..... 744.363405: <stack trace>
=> nvme_passthru_start
=> nvme_submit_user_cmd
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363405: nvme_command_effects <-nvme_passthru_start
kworker/u40:5-302 [019]Q ..... 744.363405: <stack trace>
=> nvme_init_request
=> __nvme_submit_sync_cmd
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
kworker/u40:5-302 [019]Q ..... 744.363406: nvme_execute_rq <-__nvme_submit_sync_cmd
<...>-20136 [018]L ..... 744.363406: <stack trace>
=> nvme_command_effects
=> nvme_passthru_start
=> nvme_submit_user_cmd
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363406: nvme_execute_rq <-nvme_submit_user_cmd
kworker/u40:5-302 [019]Q ..... 744.363407: <stack trace>
=> nvme_execute_rq
=> __nvme_submit_sync_cmd
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
<...>-20136 [018]L ..... 744.363407: <stack trace>
=> nvme_execute_rq
=> nvme_submit_user_cmd
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
kworker/u40:5-302 [019]Q ..... 744.363408: nvme_queue_rq <-blk_mq_dispatch_rq_list
kworker/u40:5-302 [019]Q ..... 744.363409: <stack trace>
=> nvme_queue_rq
=> blk_mq_dispatch_rq_list
=> __blk_mq_sched_dispatch_requests
=> blk_mq_sched_dispatch_requests
=> blk_mq_run_hw_queue
=> blk_execute_rq
=> nvme_execute_rq
=> __nvme_submit_sync_cmd
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
kworker/u40:5-302 [019]Q ..... 744.363409: nvme_setup_cmd <-nvme_queue_rq
kworker/u40:5-302 [019]Q ..... 744.363410: <stack trace>
=> nvme_setup_cmd
=> nvme_queue_rq
=> blk_mq_dispatch_rq_list
=> __blk_mq_sched_dispatch_requests
=> blk_mq_sched_dispatch_requests
=> blk_mq_run_hw_queue
=> blk_execute_rq
=> nvme_execute_rq
=> __nvme_submit_sync_cmd
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
kworker/u40:5-302 [019]Q ..... 744.363411: nvme_setup_cmd: nvme0: qid=0, cmdid=0, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=13, qsize=1023, cq_flags=0x3, irq_vector=13)
kworker/u40:5-302 [019]Q ..... 744.363411: nvme_setup_cmd: nvme0: qid=0, cmdid=0, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=13, qsize=1023, cq_flags=0x3, irq_vector=13)
kworker/u40:5-302 [019]Q ..... 744.363411: nvme_setup_cmd: nvme0: qid=0, cmdid=57352, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=13, qsize=1023, cq_flags=0x3, irq_vector=13)
kworker/u40:5-302 [019]Q ..... 744.363412: nvme_prep_rq.part.0 <-nvme_queue_rq
kworker/u40:5-302 [019]Q ..... 744.363413: <stack trace>
=> nvme_prep_rq.part.0
=> nvme_queue_rq
=> blk_mq_dispatch_rq_list
=> __blk_mq_sched_dispatch_requests
=> blk_mq_sched_dispatch_requests
=> blk_mq_run_hw_queue
=> blk_execute_rq
=> nvme_execute_rq
=> __nvme_submit_sync_cmd
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
## wield behavior here
kworker/u40:5-302 [019]Q ...1. 744.363413: nvme_queue_rq: nvme_sq_copy_cmd
kworker/u40:5-302 [019]L ..... 744.363415: nvme_fail_nonready_command <-nvme_queue_rq
kworker/u40:5-302 [019]L ..... 744.363416: <stack trace>
=> nvme_fail_nonready_command
=> nvme_queue_rq
=> blk_mq_dispatch_rq_list
=> __blk_mq_sched_dispatch_requests
=> blk_mq_sched_dispatch_requests
=> blk_mq_run_hw_queue
=> blk_execute_rq
=> nvme_execute_rq
=> __nvme_submit_sync_cmd
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
kworker/u40:5-302 [019]L ..... 744.363417: nvme_complete_rq <-nvme_fail_nonready_command
kworker/u40:5-302 [019]L ..... 744.363418: <stack trace>
=> nvme_complete_rq
=> nvme_fail_nonready_command
=> nvme_queue_rq
=> blk_mq_dispatch_rq_list
=> __blk_mq_sched_dispatch_requests
=> blk_mq_sched_dispatch_requests
=> blk_mq_run_hw_queue
=> blk_execute_rq
=> nvme_execute_rq
=> __nvme_submit_sync_cmd
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
kworker/u40:5-302 [019]L ..... 744.363418: nvme_complete_rq: nvme0: qid=0, cmdid=0, opcode=2, res=0x0, retries=0, flags=0x2, status=0x370
kworker/u40:5-302 [019]L ..... 744.363418: nvme_end_req <-nvme_complete_rq
kworker/u40:5-302 [019]L ..... 744.363419: <stack trace>
=> nvme_end_req
=> nvme_complete_rq
=> nvme_fail_nonready_command
=> nvme_queue_rq
=> blk_mq_dispatch_rq_list
=> __blk_mq_sched_dispatch_requests
=> blk_mq_sched_dispatch_requests
=> blk_mq_run_hw_queue
=> blk_execute_rq
=> nvme_execute_rq
=> __nvme_submit_sync_cmd
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
kworker/u40:5-302 [019]L ..... 744.363420: nvme_error_status <-nvme_end_req
kworker/u40:5-302 [019]L ..... 744.363421: <stack trace>
=> nvme_error_status
=> nvme_end_req
=> nvme_complete_rq
=> nvme_fail_nonready_command
=> nvme_queue_rq
=> blk_mq_dispatch_rq_list
=> __blk_mq_sched_dispatch_requests
=> blk_mq_sched_dispatch_requests
=> blk_mq_run_hw_queue
=> blk_execute_rq
=> nvme_execute_rq
=> __nvme_submit_sync_cmd
=> nvme_submit_sync_cmd
=> nvme_setup_io_queues
=> nvme_reset_work
=> process_scheduled_works
=> worker_thread
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
<...>-20136 [018]L ..... 744.363429: nvme_passthru_end <-nvme_submit_user_cmd
<...>-20136 [018]L ..... 744.363430: <stack trace>
=> nvme_passthru_end
=> nvme_submit_user_cmd
=> nvme_user_cmd.constprop.0
=> nvme_ctrl_ioctl
=> nvme_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363449: nvme_release <-blkdev_put_whole
<...>-20136 [018]L ..... 744.363450: <stack trace>
=> nvme_release
=> blkdev_put_whole
=> blkdev_put
=> blkdev_release
=> __fput
=> __fput_sync
=> __x64_sys_close
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
<...>-20136 [018]L ..... 744.363451: nvme_ns_release <-nvme_release
<...>-20136 [018]L ..... 744.363452: <stack trace>
=> nvme_ns_release
=> nvme_release
=> blkdev_put_whole
=> blkdev_put
=> blkdev_release
=> __fput
=> __fput_sync
=> __x64_sys_close
```
Powered by blists - more mailing lists