[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251114053441.GA2037010-mkhalfella@purestorage.com>
Date: Thu, 13 Nov 2025 21:34:41 -0800
From: Mohamed Khalfella <mkhalfella@...estorage.com>
To: Chaitanya Kulkarni <chaitanyak@...dia.com>
Cc: Casey Chen <cachen@...estorage.com>,
Vikas Manocha <vmanocha@...estorage.com>,
Yuanyuan Zhong <yzhong@...estorage.com>,
Hannes Reinecke <hare@...e.de>, Ming Lei <ming.lei@...hat.com>,
"linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
Sagi Grimberg <sagi@...mberg.me>, Jens Axboe <axboe@...nel.dk>,
"linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Keith Busch <kbusch@...nel.org>
Subject: Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid
deadlock
On Fri 2025-11-14 04:56:52 +0000, Chaitanya Kulkarni wrote:
> On 11/13/25 12:23, Mohamed Khalfella wrote:
> > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > tagset, the functions make sure that tagset and queues are marked as
> > shared when two or more queues are attached to the same tagset.
> > Initially a tagset starts as unshared and when the number of added
> > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > with all the queues attached to it. When the number of attached queues
> > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > the remaining queues as unshared.
> >
> > Both functions need to freeze current queues in tagset before setting on
> > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > hold set->tag_list_lock mutex, which makes sense as we do not want
> > queues to be added or deleted in the process. This used to work fine
> > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > made the nvme driver quiesce tagset instead of quiscing individual
> > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > set->tag_list while holding set->tag_list_lock also.
> >
> > This results in deadlock between two threads with these stacktraces:
> >
> [...]
>
> >
> > The top stacktrace is showing nvme_timeout() called to handle nvme
> > command timeout. timeout handler is trying to disable the controller and
> > as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> > to call queue callback handlers. The thread is stuck waiting for
> > set->tag_list_lock as it tires to walk the queues in set->tag_list.
> >
> > The lock is held by the second thread in the bottom stack which is
> > waiting for one of queues to be frozen. The queue usage counter will
> > drop to zero after nvme_timeout() finishes, and this will not happen
> > because the thread will wait for this mutex forever.
> >
> > Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
> > avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
> > semaphore for read since this is enough to guarantee no queues will be
> > added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
> > semaphore for write while updating set->tag_list and downgrade it to
> > read while freezing the queues. It should be safe to update set->flags
> > and hctx->flags while holding the semaphore for read since the queues
> > are already frozen.
> >
> > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > Signed-off-by: Mohamed Khalfella <mkhalfella@...estorage.com>
>
> I think there is no better way to solve this in to nvme code ?
I can not think of way to fix this issue within nvme code.
>
> will it have any impact on existing users, if any, that are relying
> on current mutex based implementation ?
>
I audited the codepaths that use the mutex to the best of my knowledge.
I think this change should not have impact on existing code that uses
the mutex.
> BTW, thanks for reporting this and providing a patch.
>
No problem.
> > ---
> > block/blk-mq-sysfs.c | 10 +++----
> > block/blk-mq.c | 63 ++++++++++++++++++++++--------------------
> > include/linux/blk-mq.h | 4 +--
> > 3 files changed, 40 insertions(+), 37 deletions(-)
> >
> > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > index 58ec293373c6..f474781654fb 100644
> > --- a/block/blk-mq-sysfs.c
> > +++ b/block/blk-mq-sysfs.c
> > @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk)
> >
> > kobject_uevent(q->mq_kobj, KOBJ_ADD);
> >
> > - mutex_lock(&q->tag_set->tag_list_lock);
> > + down_read(&q->tag_set->tag_list_rwsem);
> > queue_for_each_hw_ctx(q, hctx, i) {
> > ret = blk_mq_register_hctx(hctx);
> > if (ret)
> > goto out_unreg;
> > }
> > - mutex_unlock(&q->tag_set->tag_list_lock);
> > + up_read(&q->tag_set->tag_list_rwsem);
> > return 0;
> >
>
> [...]
>
> > static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> > struct request_queue *q)
> > {
> > - mutex_lock(&set->tag_list_lock);
> > + down_write(&set->tag_list_rwsem);
> > + if (!list_is_singular(&set->tag_list)) {
> > + if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > + queue_set_hctx_shared(q, true);
> > + list_add_tail(&q->tag_set_list, &set->tag_list);
> > + up_write(&set->tag_list_rwsem);
> > + return;
> > + }
> >
> > - /*
> > - * Check to see if we're transitioning to shared (from 1 to 2 queues).
> > - */
> > - if (!list_empty(&set->tag_list) &&
> > - !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> > - set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > - /* update existing queue */
> > - blk_mq_update_tag_set_shared(set, true);
> > - }
> > - if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> > - queue_set_hctx_shared(q, true);
> > + /* Transitioning to shared. */
> > + set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> > list_add_tail(&q->tag_set_list, &set->tag_list);
> > -
> > - mutex_unlock(&set->tag_list_lock);
> > + downgrade_write(&set->tag_list_rwsem);
>
> do we need a comment here what to expect since downgrade_write() is
> not as common as mutex_unlock()|down_write() before merging the
> patch ?
>
/*
* Downgrade the semaphore before freezing the queues to avoid
* deadlock with a thread trying to quiesce the tagset before
* completing requests.
*/
Yes, this could use some explanation. How about the three lines above?
> -ck
>
>
Powered by blists - more mailing lists