[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <759cb032-ccb4-0d74-ad7f-e84c791fc0af@huawei.com>
Date: Sat, 21 May 2022 11:33:11 +0800
From: "yukuai (C)" <yukuai3@...wei.com>
To: Ming Lei <ming.lei@...hat.com>
CC: <axboe@...nel.dk>, <linux-block@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <yi.zhang@...wei.com>
Subject: Re: [PATCH -next v2] blk-mq: fix panic during blk_mq_run_work_fn()
在 2022/05/20 21:56, Ming Lei 写道:
> On Fri, May 20, 2022 at 08:01:31PM +0800, yukuai (C) wrote:
>> 在 2022/05/20 19:39, Ming Lei 写道:
>>
>>>
>>> In short:
>>>
>>> 1) run queue can be in-progress during cleanup queue, or returns from
>>> cleanup queue; we drain it in both blk_cleanup_queue() and
>>> disk_release_mq(), see commit 2a19b28f7929 ("blk-mq: cancel blk-mq dispatch
>>> work in both blk_cleanup_queue and disk_release()")
>> I understand that, however, there is no garantee new 'hctx->run_work'
>> won't be queued after 'drain it', for this crash, I think this is how
>
> No, run queue activity will be shutdown after both disk_release_mq()
> and blk_cleanup_queue() are done.
>
> disk_release_mq() is called after all FS IOs are done, so there isn't
> any run queue from FS IO code path, either sync or async.
>
> In blk_cleanup_queue(), we only focus on passthrough request, and
> passthrough request is always explicitly allocated & freed by
> its caller, so once queue is frozen, all sync dispatch activity
> for passthrough request has been done, then it is enough to just cancel
> dispatch work for avoiding any dispatch activity.
>
Hi, Ming
Thanks for you explanation, it really help me understand the code
better.
In our test kernel, elevator_exit() is not called from
disk_release_mq(), that is the reason I thought differently about
root cause...
> That is why both request queue and hctx can be released safely
> after the two are done.
>
>> it triggered:
>>
>> assum that there is no io, while some bfq_queue is still busy:
>>
>> blk_cleanup_queue
>> blk_freeze_queue
>> blk_mq_cancel_work_sync
>> cancel_delayed_work_sync(hctx1)
>> blk_mq_run_work_fn -> hctx2
>> __blk_mq_run_hw_queue
>> blk_mq_sched_dispatch_requests
>> __blk_mq_do_dispatch_sched
>> blk_mq_delay_run_hw_queues
>> blk_mq_delay_run_hw_queue
>> -> add hctx1->run_work again
>> cancel_delayed_work_sync(hctx2)
>
> Yes, even blk_mq_delay_run_hw_queues() can be called after all
> hctx->run_work are canceled since __blk_mq_run_hw_queue() could be
> running in sync io code path, not via ->run_work.
>
> And my patch will fix the issue, won't it?
Yes, like I said before, your patch do make sense. It seems like
commit 28ce942fa2d5 ("block: move blk_exit_queue into disk_release")
is the real fix for the crash in out test.
Thanks,
Kuai
>
>
> Thanks,
> Ming
>
> .
>
Powered by blists - more mailing lists