[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z_QjxCZW7dh_v22Z@slm.duckdns.org>
Date: Mon, 7 Apr 2025 09:13:08 -1000
From: Tejun Heo <tj@...nel.org>
To: Nilay Shroff <nilay@...ux.ibm.com>
Cc: Jakub Kicinski <kuba@...nel.org>, Jens Axboe <axboe@...nel.dk>,
cgroups@...r.kernel.org, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: [RESEND] Circular locking dependency involving, elevator_lock,
rq_qos_mutex and pcpu_alloc_mutex
[sorry, lost cc list somehow, resending]
Hello,
Jakub reports the following lockdep splat. It looks like q_usage_counter
somehow depends on elevator_lock. After your recent changes, iocost init
path performs memory allocation while holding elevator_lock completing the
circular dependency.
I don't understand q_usage_counter -> elevator_lock dependency. Where is
that coming from? Ah, that's q->io_lockdep_map, not the percpu_ref itself. I
think it's elevator switch acquiring elevator_lock while queue is frozen,
which puts the elevator_lock depended on from io path, and thus you can't do
reclaiming memory allocations while holding it.
The involved commits are:
245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock")
9730763f4756 ("block: correct locking order for protecting blk-wbt parameters")
Can you please take a look? It looks like the second one is expanding the
locking scopes too wide.
Thanks.
[ 139.119772] fb-cgroups-setu/1238 is trying to acquire lock:
[ 139.119776] ffffffff867ca448 (pcpu_alloc_mutex){+.+.}-{4:4}, at: pcpu_alloc_noprof+0x96f/0x1000
[ 139.169460]
but task is already holding lock:
[ 139.169462] ffff88813ba10298 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_conf_open_bdev_frozen+0x218/0x2b0
[ 139.217563]
which lock already depends on the new lock.
[ 139.217566]
the existing dependency chain (in reverse order) is:
[ 139.217568]
-> #4 (&q->rq_qos_mutex){+.+.}-{4:4}:
[ 139.217577] __mutex_lock+0x17b/0x17c0
[ 139.217587] blkg_conf_open_bdev_frozen+0x218/0x2b0
[ 139.280864] ioc_qos_write+0xc9/0xbc0
[ 139.280870] cgroup_file_write+0x1a3/0x6f0
[ 139.280878] kernfs_fop_write_iter+0x350/0x520
[ 139.280885] vfs_write+0x9b2/0xf50
[ 139.280891] ksys_write+0xf3/0x1d0
[ 139.280896] do_syscall_64+0x6e/0x190
[ 139.280901] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 139.280906]
-> #3 (&q->elevator_lock){+.+.}-{4:4}:
[ 139.280915] __mutex_lock+0x17b/0x17c0
[ 139.280919] blkg_conf_open_bdev_frozen+0x1c8/0x2b0
[ 139.280925] ioc_qos_write+0xc9/0xbc0
[ 139.280929] cgroup_file_write+0x1a3/0x6f0
[ 139.280934] kernfs_fop_write_iter+0x350/0x520
[ 139.280938] vfs_write+0x9b2/0xf50
[ 139.280943] ksys_write+0xf3/0x1d0
[ 139.280947] do_syscall_64+0x6e/0x190
[ 139.280952] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 139.280956]
-> #2 (&q->q_usage_counter(io)#2){++++}-{0:0}:
[ 139.280965] blk_alloc_queue+0x5c1/0x700
[ 139.280971] blk_mq_alloc_queue+0x14c/0x230
[ 139.280978] __blk_mq_alloc_disk+0x15/0xc0
[ 139.280983] nvme_alloc_ns+0x21d/0x30f0
[ 139.280988] nvme_scan_ns+0x4f1/0x850
[ 139.280991] async_run_entry_fn+0x93/0x4f0
[ 139.280997] process_one_work+0x89e/0x1910
[ 139.281001] worker_thread+0x58d/0xcf0
[ 139.281005] kthread+0x3d5/0x7a0
[ 139.281010] ret_from_fork+0x2d/0x70
[ 139.281016] ret_from_fork_asm+0x11/0x20
[ 139.281023]
-> #1 (fs_reclaim){+.+.}-{0:0}:
[ 139.281031] fs_reclaim_acquire+0xff/0x150
[ 139.281037] __kmalloc_noprof+0xa9/0x5f0
[ 139.281042] pcpu_create_chunk+0x23/0x6e0
[ 139.281049] pcpu_alloc_noprof+0xd34/0x1000
[ 139.281054] bts_init+0xaa/0x180
[ 139.281060] do_one_initcall+0xfa/0x500
[ 139.281065] kernel_init_freeable+0x4af/0x6d0
[ 139.281070] kernel_init+0x1b/0x1d0
[ 139.281074] ret_from_fork+0x2d/0x70
[ 139.281078] ret_from_fork_asm+0x11/0x20
[ 139.281083]
-> #0 (pcpu_alloc_mutex){+.+.}-{4:4}:
[ 139.281091] __lock_acquire+0x1569/0x2640
[ 139.281097] lock_acquire+0x179/0x330
[ 139.281102] __mutex_lock+0x17b/0x17c0
[ 139.281106] pcpu_alloc_noprof+0x96f/0x1000
[ 139.281111] blk_iocost_init+0x6f/0x820
[ 139.281116] ioc_qos_write+0x468/0xbc0
[ 139.281120] cgroup_file_write+0x1a3/0x6f0
[ 139.281125] kernfs_fop_write_iter+0x350/0x520
[ 139.281130] vfs_write+0x9b2/0xf50
[ 139.281134] ksys_write+0xf3/0x1d0
[ 139.281138] do_syscall_64+0x6e/0x190
[ 139.281143] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 139.281147]
other info that might help us debug this:
[ 139.281149] Chain exists of:
pcpu_alloc_mutex --> &q->elevator_lock --> &q->rq_qos_mutex
[ 139.281158] Possible unsafe locking scenario:
[ 139.281159] CPU0 CPU1
[ 139.281161] ---- ----
[ 139.281162] lock(&q->rq_qos_mutex);
[ 139.281166] lock(&q->elevator_lock);
[ 139.281170] lock(&q->rq_qos_mutex);
[ 139.281174] lock(pcpu_alloc_mutex);
[ 139.281178]
*** DEADLOCK ***
[ 139.281179] 8 locks held by fb-cgroups-setu/1238:
[ 139.281183] #0: ffff888114b3aaf8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x22c/0x2e0
[ 139.281197] #1: ffff888148a7c400 (sb_writers#8){.+.+}-{0:0}, at: ksys_write+0xf3/0x1d0
[ 139.281210] #2: ffff88819c5a1088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x212/0x520
[ 139.281223] #3: ffff888124ec8b48 (kn->active#101){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x235/0x520
[ 139.281236] #4: ffff88813ba100a8 (&q->q_usage_counter(io)#2){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0xe/0x20
[ 139.281249] #5: ffff88813ba100e0 (&q->q_usage_counter(queue)#2){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0xe/0x20
[ 139.281262] #6: ffff88813ba105c0 (&q->elevator_lock){+.+.}-{4:4}, at: blkg_conf_open_bdev_frozen+0x1c8/0x2b0
[ 139.281275] #7: ffff88813ba10298 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_conf_open_bdev_frozen+0x218/0x2b0
[ 139.281286]
stack backtrace:
[ 139.281291] CPU: 34 UID: 0 PID: 1238 Comm: fb-cgroups-setu Tainted: G N 6.14.0-13254-g2ecc111972cc #114 PREEMPT(undef)
[ 139.281299] Tainted: [N]=TEST
[ 139.281301] Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020
[ 139.281304] Call Trace:
[ 139.281307] <TASK>
[ 139.281309] dump_stack_lvl+0x7e/0xc0
[ 139.281318] print_circular_bug+0x2d8/0x410
[ 139.281326] check_noncircular+0x12b/0x140
[ 139.281336] __lock_acquire+0x1569/0x2640
[ 139.281348] lock_acquire+0x179/0x330
[ 139.281353] ? pcpu_alloc_noprof+0x96f/0x1000
[ 139.281365] __mutex_lock+0x17b/0x17c0
[ 139.281370] ? pcpu_alloc_noprof+0x96f/0x1000
[ 139.281375] ? __kasan_kmalloc+0x77/0x90
[ 139.281380] ? ioc_qos_write+0x468/0xbc0
[ 139.281384] ? cgroup_file_write+0x1a3/0x6f0
[ 139.281390] ? pcpu_alloc_noprof+0x96f/0x1000
[ 139.281395] ? ksys_write+0xf3/0x1d0
[ 139.281399] ? do_syscall_64+0x6e/0x190
[ 139.281404] ? entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 139.281411] ? mutex_lock_io_nested+0x1570/0x1570
[ 139.281418] ? do_raw_spin_lock+0x12c/0x270
[ 139.281425] ? find_held_lock+0x2b/0x80
[ 139.281432] ? mark_held_locks+0x49/0x70
[ 139.281437] ? _raw_spin_unlock_irqrestore+0x55/0x70
[ 139.281442] ? lockdep_hardirqs_on+0x78/0x100
[ 139.281449] ? pcpu_alloc_noprof+0x96f/0x1000
[ 139.281454] pcpu_alloc_noprof+0x96f/0x1000
[ 139.281465] ? kasan_save_track+0x10/0x30
[ 139.281471] blk_iocost_init+0x6f/0x820
[ 139.281480] ioc_qos_write+0x468/0xbc0
[ 139.281485] ? __lock_acquire+0x42c/0x2640
[ 139.281494] ? ioc_cost_model_write+0x7a0/0x7a0
[ 139.281501] ? __lock_acquire+0x42c/0x2640
[ 139.281509] ? rcu_is_watching+0x11/0xb0
[ 139.281519] ? find_held_lock+0x2b/0x80
[ 139.281525] ? kernfs_root+0xb2/0x1c0
[ 139.281532] ? kernfs_root+0xbc/0x1c0
[ 139.281539] cgroup_file_write+0x1a3/0x6f0
[ 139.281546] ? cgroup_addrm_files+0xa90/0xa90
[ 139.281552] ? __virt_addr_valid+0x1e1/0x3c0
[ 139.281563] ? cgroup_addrm_files+0xa90/0xa90
[ 139.281568] kernfs_fop_write_iter+0x350/0x520
[ 139.281576] vfs_write+0x9b2/0xf50
[ 139.281583] ? kernel_write+0x550/0x550
[ 139.281600] ksys_write+0xf3/0x1d0
[ 139.281606] ? __ia32_sys_read+0xa0/0xa0
[ 139.281611] ? rcu_is_watching+0x11/0xb0
[ 139.281620] do_syscall_64+0x6e/0x190
[ 139.281626] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 139.281631] RIP: 0033:0x7f9d58116f8d
[ 139.281643] Code: e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 a8 ca f7 ff 41 89 c0 48 8b 55 e8 48 8b 75 f0 8b 7d f8 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3b 44 89 c7 48 89 45 f8 e8 df ca f7 ff 48 8b
[ 139.281648] RSP: 002b:00007ffcdd0a6890 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[ 139.281653] RAX: ffffffffffffffda RBX: 0000000000000043 RCX: 00007f9d58116f8d
[ 139.281656] RDX: 0000000000000043 RSI: 00007f9d56ecf200 RDI: 0000000000000007
[ 139.281659] RBP: 00007ffcdd0a68b0 R08: 0000000000000000 R09: 00007f9d57a19010
[ 139.281662] R10: 00007f9d5800afd0 R11: 0000000000000293 R12: 0000000000000043
[ 139.281665] R13: 0000000000000007 R14: 00007ffcdd0a70f0 R15: 0000000000000000
[ 139.281676] </TASK>
Thanks.
--
tejun
Powered by blists - more mailing lists