[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240428102527.37462-1-wangbing.kuang@shopee.com>
Date: Sun, 28 Apr 2024 18:25:27 +0800
From: kwb <wangbing.kuang@...pee.com>
To: sagi@...mberg.me
Cc: axboe@...com,
chunguang.xu@...pee.com,
hch@....de,
james.smart@...adcom.com,
kbusch@...nel.org,
linux-kernel@...r.kernel.org,
linux-nvme@...ts.infradead.org,
wangbing.kuang@...pee.com
Subject: Re: [Bug Report] nvme connect deadlock in allocating tag
>On 28/04/2024 12:16, Wangbing Kuang wrote:
>> "The error_recovery work should unquiesce the admin_q, which should fail
>> fast all pending admin commands,
>> so it is unclear to me how the connect process gets stuck."
>> I think the reason is: the command can be unquiesce but the tag cannot be
>> return until command success.
>
>The error recovery also cancels all pending requests. See
>nvme_cancel_admin_tagset
nvme_cancel_admin_tagset can cancel requests before stop admin queue, but
cannot cancel requests before next reconnect time.
The time line is:
recover failed(we can reproduce by hang io for more time)
-> reconnect delay
-> multi nvme list issue(used up tagset)
-> reconnect start(wait for tag when call nvme_enabel_ctrl and nvme_wait_ready)
>>
>> "What is step (2) - make nvme io timeout to recover the connection?"
>> I use spdk-nvmf-target for backend. It is easy to set read/write
>> nvmf-target io hang and unhang. So I just set the io hang for over 30
>> seconds, then trigger linux-nvmf-host trigger io timeout event. then io
>> timeout will trigger connection recover.
>> by the way, I use multipath=0
>
>Interesting, does this happen with multipath=Y ?
>I didn't expect people to be using multipath=0 for fabrics in the past few
>years.
No certain, I did not test on multipath=Y.We choose multipath=0 cos less code and we need only one path
>>
>> "Is this reproducing with upstream nvme? or is this some distro kernel
>> where this happens?"
>> it is reproduced in a kernel based from v5.15, but I think this is common
>> error.
>
>It would be beneficial to verify this.
ok, test need more time, but we can first verify it only in v5.15.
>Do you have the below patch applied?
>de105068fead ("nvme: fix reconnection fail due to reserved tag allocation")
yes, my modification is inspired from the commit. Chungguang.xu is my coleague
Powered by blists - more mailing lists