[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7de6c29f-9058-41ca-af95-f3aaf67a64d3@linux.ibm.com>
Date: Tue, 14 Oct 2025 23:27:29 +0530
From: Nilay Shroff <nilay@...ux.ibm.com>
To: Yu Kuai <hailan@...uai.org.cn>, Yu Kuai <yukuai3@...wei.com>,
ming.lei@...hat.com, tj@...nel.org, josef@...icpanda.com,
axboe@...nel.dk
Cc: cgroups@...r.kernel.org, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org, yukuai1@...weicloud.com,
yi.zhang@...wei.com, yangerkun@...wei.com, johnny.chenyi@...wei.com
Subject: Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
On 10/14/25 4:44 PM, Yu Kuai wrote:
> Hi,
>
> 在 2025/10/14 18:58, Nilay Shroff 写道:
>>
>> On 10/14/25 7:51 AM, Yu Kuai wrote:
>>> Currently rq-qos debugfs entries is created from rq_qos_add(), while
>>> rq_qos_add() requires queue to be freezed. This can deadlock because
>>>
>>> creating new entries can trigger fs reclaim.
>>>
>>> Fix this problem by delaying creating rq-qos debugfs entries until
>>> it's initialization is complete.
>>>
>>> - For wbt, it can be initialized by default of by blk-sysfs, fix it by
>>> delaying after wbt_init();
>>> - For other policies, they can only be initialized by blkg configuration,
>>> fix it by delaying to blkg_conf_end();
>>>
>>> Noted this set is cooked on the top of my other thread:
>>> https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@kernel.org/
>>>
>>> And the deadlock can be reporduced with above thead, by running blktests
>>> throtl/001 with wbt enabled by default. While the deadlock is really a
>>> long term problem.
>>>
>> While freezing the queue we also mark GFP_NOIO scope, so doesn't that
>> help avoid fs-reclaim? Or maybe if you can share the lockdep splat
>> encountered running throtl/001?
>
> Yes, we can avoid fs-reclaim if queue is freezing, however,
> because debugfs is a generic file system, and we can't avoid fs reclaim from
> all context. There is still
>
> Following is the log with above set and wbt enabled by default, the set acquire
> lock order by:
>
> freeze queue -> elevator lock -> rq_qos_mutex -> blkcg_mutex
>
> However, fs-reclaim from other context cause the deadlock report.
>
>
> [ 45.632372][ T531] ======================================================
> [ 45.633734][ T531] WARNING: possible circular locking dependency detected
> [ 45.635062][ T531] 6.17.0-gfd4a560a0864-dirty #30 Not tainted
> [ 45.636220][ T531] ------------------------------------------------------
> [ 45.637587][ T531] check/531 is trying to acquire lock:
> [ 45.638626][ T531] ffff9473884382b0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_
> conf_start+0x116/0x190
> [ 45.640416][ T531]
> [ 45.640416][ T531] but task is already holding lock:
> [ 45.641828][ T531] ffff9473884385d8 (&q->elevator_lock){+.+.}-{4:4}, at: blkg
> _conf_start+0x108/0x190
> [ 45.643322][ T531]
> [ 45.643322][ T531] which lock already depends on the new lock.
> [ 45.643322][ T531]
> [ 45.644862][ T531]
> [ 45.644862][ T531] the existing dependency chain (in reverse order) is:
> [ 45.646046][ T531]
> [ 45.646046][ T531] -> #5 (&q->elevator_lock){+.+.}-{4:4}:
> [ 45.647052][ T531] __mutex_lock+0xd3/0x8d0
> [ 45.647716][ T531] blkg_conf_start+0x108/0x190
> [ 45.648395][ T531] tg_set_limit+0x74/0x300
> [ 45.649046][ T531] kernfs_fop_write_iter+0x14a/0x210
> [ 45.649813][ T531] vfs_write+0x29e/0x550
> [ 45.650413][ T531] ksys_write+0x74/0xf0
> [ 45.651032][ T531] do_syscall_64+0xbb/0x380
> [ 45.651730][ T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 45.652533][ T531]
> [ 45.652533][ T531] -> #4 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> [ 45.653297][ T531] blk_alloc_queue+0x30b/0x350
> [ 45.653807][ T531] blk_mq_alloc_queue+0x5d/0xd0
> [ 45.654300][ T531] __blk_mq_alloc_disk+0x13/0x60
> [ 45.654810][ T531] null_add_dev+0x2f8/0x660 [null_blk]
> [ 45.655374][ T531] nullb_device_power_store+0x88/0x130 [null_blk]
> [ 45.656009][ T531] configfs_write_iter+0xcb/0x130 [configfs]
> [ 45.656614][ T531] vfs_write+0x29e/0x550
> [ 45.657045][ T531] ksys_write+0x74/0xf0
> [ 45.657497][ T531] do_syscall_64+0xbb/0x380
> [ 45.657958][ T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 45.658595][ T531]
> [ 45.658595][ T531] -> #3 (fs_reclaim){+.+.}-{0:0}:
> [ 45.659266][ T531] fs_reclaim_acquire+0xa4/0xe0
> [ 45.659783][ T531] kmem_cache_alloc_lru_noprof+0x3b/0x2a0
> [ 45.660369][ T531] alloc_inode+0x1a/0x60
> [ 45.660873][ T531] new_inode+0xd/0x90
> [ 45.661291][ T531] debugfs_create_dir+0x38/0x160
> [ 45.661806][ T531] component_debug_init+0x12/0x20
> [ 45.662316][ T531] do_one_initcall+0x68/0x2b0
> [ 45.662807][ T531] kernel_init_freeable+0x238/0x290
> [ 45.663241][ T531] kernel_init+0x15/0x130
> [ 45.663659][ T531] ret_from_fork+0x182/0x1d0
> [ 45.664054][ T531] ret_from_fork_asm+0x1a/0x30
> [ 45.664601][ T531]
> [ 45.664601][ T531] -> #2 (&sb->s_type->i_mutex_key#2){+.+.15:25:59 [194/1936]
> [ 45.665274][ T531] down_write+0x3a/0xb0
> [ 45.665669][ T531] simple_start_creating+0x51/0x110
> [ 45.666097][ T531] debugfs_start_creating+0x68/0xd0
> [ 45.666561][ T531] debugfs_create_dir+0x10/0x160
> [ 45.666970][ T531] blk_register_queue+0x8b/0x1e0
> [ 45.667386][ T531] __add_disk+0x253/0x3f0
> [ 45.667804][ T531] add_disk_fwnode+0x71/0x180
> [ 45.668205][ T531] virtblk_probe+0x9c2/0xb50
> [ 45.668603][ T531] virtio_dev_probe+0x223/0x380
> [ 45.669004][ T531] really_probe+0xc2/0x260
> [ 45.669369][ T531] __driver_probe_device+0x6e/0x100
> [ 45.669856][ T531] driver_probe_device+0x1a/0xd0
> [ 45.670263][ T531] __driver_attach+0x93/0x1a0
> [ 45.670672][ T531] bus_for_each_dev+0x87/0xe0
> [ 45.671063][ T531] bus_add_driver+0xe0/0x1d0
> [ 45.671469][ T531] driver_register+0x70/0xe0
> [ 45.671856][ T531] virtio_blk_init+0x53/0x80
> [ 45.672258][ T531] do_one_initcall+0x68/0x2b0
> [ 45.672661][ T531] kernel_init_freeable+0x238/0x290
> [ 45.673097][ T531] kernel_init+0x15/0x130
> [ 45.673510][ T531] ret_from_fork+0x182/0x1d0
> [ 45.673907][ T531] ret_from_fork_asm+0x1a/0x30
> [ 45.674319][ T531]
> [ 45.674319][ T531] -> #1 (&q->debugfs_mutex){+.+.}-{4:4}:
> [ 45.674929][ T531] __mutex_lock+0xd3/0x8d0
> [ 45.675315][ T531] rq_qos_add+0xe0/0x130
> [ 45.675717][ T531] wbt_init+0x183/0x210
> [ 45.676062][ T531] blk_register_queue+0xdf/0x1e0
> [ 45.676490][ T531] __add_disk+0x253/0x3f0
> [ 45.676844][ T531] add_disk_fwnode+0x71/0x180
> [ 45.677247][ T531] virtblk_probe+0x9c2/0xb50
> [ 45.677648][ T531] virtio_dev_probe+0x223/0x380
> [ 45.678044][ T531] really_probe+0xc2/0x260
> [ 45.678411][ T531] __driver_probe_device+0x6e/0x100
> [ 45.678875][ T531] driver_probe_device+0x1a/0xd0
> [ 45.679282][ T531] __driver_attach+0x93/0x1a0
> [ 45.679678][ T531] bus_for_each_dev+0x87/0xe0
> [ 45.680053][ T531] bus_add_driver+0xe0/0x1d0
> [ 45.680458][ T531] driver_register+0x70/0xe0
> [ 45.680823][ T531] virtio_blk_init+0x53/0x80
> [ 45.681208][ T531] do_one_initcall+0x68/0x2b0
> [ 45.681611][ T531] kernel_init_freeable+0x238/0x290
> [ 45.682027][ T531] kernel_init+0x15/0x130
> [ 45.682392][ T531] ret_from_fork+0x182/0x1d0
> [ 45.682829][ T531] ret_from_fork_asm+0x1a/0x30
> [ 45.683240][ T531]
> [ 45.683240][ T531] -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
> [ 45.683867][ T531] __lock_acquire+0x103d/0x1650
> [ 45.684411][ T531] lock_acquire+0xbc/0x2c0
> [ 45.684798][ T531] __mutex_lock+0xd3/0x8d0
> [ 45.685172][ T531] blkg_conf_start+0x116/0x190
> [ 45.685623][ T531] tg_set_limit+0x74/0x300
> [ 45.685986][ T531] kernfs_fop_write_iter+0x14a/0x210
> [ 45.686405][ T531] vfs_write+0x29e/0x550
> [ 45.686771][ T531] ksys_write+0x74/0xf0
> [ 45.687112][ T531] do_syscall_64+0xbb/0x380
> [ 45.687514][ T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 45.687983][ T531]
> [ 45.687983][ T531] other info that might help us debug this:
> [ 45.687983][ T531]
> [ 45.688760][ T531] Chain exists of:
> [ 45.688760][ T531] &q->rq_qos_mutex --> &q->q_usage_counter(io)#3 --> &q->e
> levator_lock
> [ 45.688760][ T531]
> [ 45.689817][ T531] Possible unsafe locking scenario:
> [ 45.689817][ T531]
> [ 45.690359][ T531] CPU0 CPU1
> [ 45.690764][ T531] ---- ----
> [ 45.691172][ T531] lock(&q->elevator_lock);
> [ 45.691544][ T531] lock(&q->q_usage_counter(io
> )#3);
> [ 45.692108][ T531] lock(&q->elevator_lock);
> [ 45.692659][ T531] lock(&q->rq_qos_mutex);
> [ 45.692994][ T531]
> [ 45.692994][ T531] *** DEADLOCK ***
>
>From the above trace, I see that thread #1 (virtblk device) is
unable to acquire &q->debugfs_mutex and it's pending on thread #2
(another virtblk device) because thread #2 has already acquired
the same &q->debugfs_mutex. And this sequence of events appears
to be a false-positive. The reasoning is as follows:
Each virtio block device (e.g., virtblk0, virtblk1, etc.) has its own
struct request_queue, and therefore its own instance of q->debugfs_mutex.
These mutexes are distinct objects at different memory addresses. Hence,
thread #1 cannot physically block on thread #2’s q->debugfs_mutex, since
the mutex wait queue is per-lock address.
However, lockdep does not track individual mutex addresses. Instead, it
operates at the level of lock classes, identified by a lock class key.
Because all q->debugfs_mutex instances share the same lock class key,
lockdep treats them as equivalent and reports a potential circular
dependency across devices, even though such dependency cannot actually
occur at runtime. So IMO, to eliminate this false-positive warning, we
should assign a separate lockdep class/key for each queue’s q->debugfs_mutex.
This will ensure lockdep distinguishes mutex instances belonging to
different queues.
Thanks,
--Nilay
Powered by blists - more mailing lists