[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f60dc68f-9206-2bfb-950e-cb312f1c4c8b@huawei.com>
Date: Tue, 9 Mar 2021 17:47:15 +0000
From: John Garry <john.garry@...wei.com>
To: Bart Van Assche <bvanassche@....org>, <hare@...e.de>,
<ming.lei@...hat.com>, <axboe@...nel.dk>, <hch@....de>
CC: <linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<pragalla@...eaurora.org>, <kashyap.desai@...adcom.com>,
<yuyufen@...wei.com>
Subject: Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting
elevator
On 08/03/2021 19:59, Bart Van Assche wrote:
>>> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
>>> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
>>> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
>>> call and if that causes all mtip_abort_cmd() calls to be skipped?
>>
>> I'm not sure that I understand this problem you describe. So if
>> blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called,
>> either can happen:
>> a. normal operation, iter_usage_counter initially holds >= 1, and then
>> iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we
>> iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter()
>> will also increase iter_usage_counter.
>> b. we're switching IO scheduler. In this scenario, first we quiesce
>> all queues. After that, there should be no active requests. At that
>> point, we ensure any calls to blk_mq_tagset_busy_iter() are finished
>> and block (or discard may be a better term) any more calls. Blocking
>> any more calls should be safe as there are no requests to iter.
>> atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any
>> more calls.
>
Hi Bart,
> My concern is about the insertion of the early return statement in
> blk_mq_tagset_busy_iter().
So I take this approach as I don't see any way to use a mutual exclusion
waiting mechanism to block calls to blk_mq_tagset_busy_iter() while the
IO scheduler is being switched.
The reason is that blk_mq_tagset_busy_iter() can be called from any
context, including hardirq.
> Although most blk_mq_tagset_busy_iter()
> callers can handle skipping certain blk_mq_tagset_busy_iter() calls
> (e.g. when gathering statistics), I'm not sure this is safe for all
> blk_mq_tagset_busy_iter() callers. The example I cited is an example of
> a blk_mq_tagset_busy_iter() call with side effects.
I don't like to think that we're skipping it, which may imply that there
are some active requests to iter and we're just ignoring them.
It's more like: we know that there are no requests active, so don't
bother trying to iterate.
>
> The mtip driver allocates one tag set per request queue so quiescing
> queues should be sufficient to address my concern for the mtip driver.
>
> The NVMe core and SCSI core however share a single tag set across
> multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl()
> I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls
> blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm
> not sure it is safe to skip the nvme_cancel_request() calls if the I/O
> scheduler for another NVMe namespace is being modified.
Again, I would be relying on all request_queues associated with that
tagset to be queisced when switching IO scheduler at the point
blk_mq_tagset_busy_iter() is called and returns early.
Now if there were active requests, I am relying on the request queue
quiescing to flush them. So blk_mq_tagset_busy_iter() could be called
during that quiescing period, and would continue to iter the requests.
This does fall over if some tags are allocated without associated
request queue, which I do not know exists.
Thanks,
John
Powered by blists - more mailing lists