linux-kernel - Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f60dc68f-9206-2bfb-950e-cb312f1c4c8b@huawei.com>
Date:   Tue, 9 Mar 2021 17:47:15 +0000
From:   John Garry <john.garry@...wei.com>
To:     Bart Van Assche <bvanassche@....org>, <hare@...e.de>,
        <ming.lei@...hat.com>, <axboe@...nel.dk>, <hch@....de>
CC:     <linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <pragalla@...eaurora.org>, <kashyap.desai@...adcom.com>,
        <yuyufen@...wei.com>
Subject: Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting
 elevator

On 08/03/2021 19:59, Bart Van Assche wrote:
>>> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
>>> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
>>> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
>>> call and if that causes all mtip_abort_cmd() calls to be skipped?
>>
>> I'm not sure that I understand this problem you describe. So if 
>> blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, 
>> either can happen:
>> a. normal operation, iter_usage_counter initially holds >= 1, and then 
>> iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we 
>> iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() 
>> will also increase iter_usage_counter.
>> b. we're switching IO scheduler. In this scenario, first we quiesce 
>> all queues. After that, there should be no active requests. At that 
>> point, we ensure any calls to blk_mq_tagset_busy_iter() are finished 
>> and block (or discard may be a better term) any more calls. Blocking 
>> any more calls should be safe as there are no requests to iter. 
>> atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any 
>> more calls.
> 


Hi Bart,

> My concern is about the insertion of the early return statement in 
> blk_mq_tagset_busy_iter(). 

So I take this approach as I don't see any way to use a mutual exclusion 
waiting mechanism to block calls to blk_mq_tagset_busy_iter() while the 
IO scheduler is being switched.

The reason is that blk_mq_tagset_busy_iter() can be called from any 
context, including hardirq.

> Although most blk_mq_tagset_busy_iter() 
> callers can handle skipping certain blk_mq_tagset_busy_iter() calls 
> (e.g. when gathering statistics), I'm not sure this is safe for all 
> blk_mq_tagset_busy_iter() callers. The example I cited is an example of 
> a blk_mq_tagset_busy_iter() call with side effects.

I don't like to think that we're skipping it, which may imply that there 
are some active requests to iter and we're just ignoring them.

It's more like: we know that there are no requests active, so don't 
bother trying to iterate.

> 
> The mtip driver allocates one tag set per request queue so quiescing 
> queues should be sufficient to address my concern for the mtip driver.
> 
> The NVMe core and SCSI core however share a single tag set across 
> multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl()
> I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls 
> blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm 
> not sure it is safe to skip the nvme_cancel_request() calls if the I/O 
> scheduler for another NVMe namespace is being modified.

Again, I would be relying on all request_queues associated with that 
tagset to be queisced when switching IO scheduler at the point 
blk_mq_tagset_busy_iter() is called and returns early.

Now if there were active requests, I am relying on the request queue 
quiescing to flush them. So blk_mq_tagset_busy_iter() could be called 
during that quiescing period, and would continue to iter the requests.

This does fall over if some tags are allocated without associated 
request queue, which I do not know exists.

Thanks,
John