linux-kernel - Re: [PATCH] nvme: unquiesce the queue before cleaup it

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <74ea389f-499f-5162-b9c0-14d02e273455@oracle.com>
Date:   Sun, 22 Apr 2018 23:00:53 +0800
From:   "jianchao.wang" <jianchao.w.wang@...cle.com>
To:     Max Gurtovoy <maxg@...lanox.com>, keith.busch@...el.com,
        axboe@...com, hch@....de, sagi@...mberg.me,
        linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] nvme: unquiesce the queue before cleaup it

Hi Max

That's really appreciated!
Here is my test script.

loop_reset_controller.sh
#!/bin/bash 
while true
do
	echo 1 > /sys/block/nvme0n1/device/reset_controller 
	sleep 1
done

loop_unbind_driver.sh 
#!/bin/bash 
while true
do
	echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind 
	sleep 2
	echo "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind
	sleep 2
done

loop_io.sh 
#!/bin/bash 

file="/dev/nvme0n1"
echo $file
while true;
do
	if [ -e $file ];then
		fio fio_job_rand_read.ini
	else
		echo "Not found"
		sleep 1
	fi
done

The fio jobs is as below:
size=512m
rw=randread
bs=4k
ioengine=libaio
iodepth=64
direct=1
numjobs=16
filename=/dev/nvme0n1
group_reporting

I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh.
And if lucky, I will get io hang in 3 minutes. ;)
Such as:

[  142.858074] nvme nvme0: pci function 0000:02:00.0
[  144.972256] nvme nvme0: failed to mark controller state 1
[  144.972289] nvme nvme0: Removing after probe failure status: 0
[  185.312344] INFO: task bash:1673 blocked for more than 30 seconds.
[  185.312889]       Not tainted 4.17.0-rc1+ #6
[  185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  185.313049] bash            D    0  1673   1629 0x00000080
[  185.313061] Call Trace:
[  185.313083]  ? __schedule+0x3de/0xac0
[  185.313103]  schedule+0x3c/0x90
[  185.313111]  blk_mq_freeze_queue_wait+0x44/0x90
[  185.313123]  ? wait_woken+0x90/0x90
[  185.313133]  blk_cleanup_queue+0xe1/0x280
[  185.313145]  nvme_ns_remove+0x1c8/0x260
[  185.313159]  nvme_remove_namespaces+0x7f/0xa0
[  185.313170]  nvme_remove+0x6c/0x130
[  185.313181]  pci_device_remove+0x36/0xb0
[  185.313193]  device_release_driver_internal+0x160/0x230
[  185.313205]  unbind_store+0xfe/0x150
[  185.313219]  kernfs_fop_write+0x114/0x190
[  185.313234]  __vfs_write+0x23/0x150
[  185.313246]  ? rcu_read_lock_sched_held+0x3f/0x70
[  185.313252]  ? preempt_count_sub+0x92/0xd0
[  185.313259]  ? __sb_start_write+0xf8/0x200
[  185.313271]  vfs_write+0xc5/0x1c0
[  185.313284]  ksys_write+0x45/0xa0
[  185.313298]  do_syscall_64+0x5a/0x1a0
[  185.313308]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

And get following information in block debugfs:
root@...l-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list 
000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1}
00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
root@...l-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state
DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED

We can see there were reqs on ctx rq_list and the request_queue is QUIESCED. 

Thanks again !!
Jianchao

On 04/22/2018 10:48 PM, Max Gurtovoy wrote:
> 
> 
> On 4/22/2018 5:25 PM, jianchao.wang wrote:
>> Hi Max
>>
>> No, I only tested it on PCIe one.
>> And sorry for that I didn't state that.
> 
> Please send your exact test steps and we'll run it using RDMA transport.
> I also want to run a mini regression on this one since it may effect other flows.
> 
>>
>> Thanks
>> Jianchao
>>
>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote:
>>> Hi Jianchao,
>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ?
>>>
>>> thanks,
>>> Max.
>>>
>>> On 4/22/2018 4:32 PM, jianchao.wang wrote:
>>>> Hi keith
>>>>
>>>> Would you please take a look at this patch.
>>>>
>>>> This issue could be reproduced easily with a driver bind/unbind loop,
>>>> a reset loop and a IO loop at the same time.
>>>>
>>>> Thanks
>>>> Jianchao
>>>>
>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote:
>>>>> There is race between nvme_remove and nvme_reset_work that can
>>>>> lead to io hang.
>>>>>
>>>>> nvme_remove                    nvme_reset_work
>>>>> -> change state to DELETING
>>>>>                                  -> fail to change state to LIVE
>>>>>                                  -> nvme_remove_dead_ctrl
>>>>>                                    -> nvme_dev_disable
>>>>>                                      -> quiesce request_queue
>>>>>                                    -> queue remove_work
>>>>> -> cancel_work_sync reset_work
>>>>> -> nvme_remove_namespaces
>>>>>     -> splice ctrl->namespaces
>>>>>                                  nvme_remove_dead_ctrl_work
>>>>>                                  -> nvme_kill_queues
>>>>>     -> nvme_ns_remove               do nothing
>>>>>       -> blk_cleanup_queue
>>>>>         -> blk_freeze_queue
>>>>> Finally, the request_queue is quiesced state when wait freeze,
>>>>> we will get io hang here.
>>>>>
>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove.
>>>>> We have spliced the ctrl->namespaces, so nobody could access them
>>>>> and quiesce the queue any more.
>>>>>
>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@...cle.com>
>>>>> ---
>>>>>    drivers/nvme/host/core.c | 9 ++++++++-
>>>>>    1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>> index 9df4f71..0e95082 100644
>>>>> --- a/drivers/nvme/host/core.c
>>>>> +++ b/drivers/nvme/host/core.c
>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>>>>>        list_splice_init(&ctrl->namespaces, &ns_list);
>>>>>        up_write(&ctrl->namespaces_rwsem);
>>>>>    -    list_for_each_entry_safe(ns, next, &ns_list, list)
>>>>> +    /*
>>>>> +     * After splice the namespaces list from the ctrl->namespaces,
>>>>> +     * nobody could get them anymore, let's unquiesce the request_queue
>>>>> +     * forcibly to avoid io hang.
>>>>> +     */
>>>>> +    list_for_each_entry_safe(ns, next, &ns_list, list) {
>>>>> +        blk_mq_unquiesce_queue(ns->queue);
>>>>>            nvme_ns_remove(ns);
>>>>> +    }
>>>>>    }
>>>>>    EXPORT_SYMBOL_GPL(nvme_remove_namespaces);
>>>>>   
>>>>
>>>> _______________________________________________
>>>> Linux-nvme mailing list
>>>> Linux-nvme@...ts.infradead.org
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e=
>>>>
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme@...ts.infradead.org
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e=
>>>