[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <800a2a54-9156-482f-9bc7-e99eea7dad5a@nvidia.com>
Date: Fri, 8 Mar 2024 00:19:47 +0000
From: Chaitanya Kulkarni <chaitanyak@...dia.com>
To: brookxu.cn <brookxu.cn@...il.com>, "kbusch@...nel.org"
<kbusch@...nel.org>, "axboe@...nel.dk" <axboe@...nel.dk>, "hch@....de"
<hch@....de>, "sagi@...mberg.me" <sagi@...mberg.me>
CC: "linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] nvme: fix reconnection fail due to reserved tag
allocation
On 3/7/24 03:06, brookxu.cn wrote:
> From: Chunguang Xu <chunguang.xu@...pee.com>
>
> We found a issue on production environment while using NVMe
> over RDMA, admin_q reconnect failed forever while remote
> target and network is ok. After dig into it, we found it
> may caused by a ABBA deadlock due to tag allocation. In my
> case, the tag was hold by a keep alive request waiting
> inside admin_q, as we quiesced admin_q while reset ctrl,
> so the request maked as idle and will not process before
> reset success. As fabric_q shares tagset with admin_q,
> while reconnect remote target, we need a tag for connect
> command, but the only one reserved tag was held by keep
> alive command which waiting inside admin_q. As a result,
> we failed to reconnect admin_q forever. In order to fix
> this issue, I think we should keep two reserved tags for
> admin queue.
plz consider rearranged line length, no change in wording to use the
full length :-
We found a issue on production environment while using NVMe over RDMA,
admin_q reconnect failed forever while remote target and network is ok.
After dig into it, we found it may caused by a ABBA deadlock due to tag
allocation. In my case, the tag was hold by a keep alive request
waiting inside admin_q, as we quiesced admin_q while reset ctrl, so the
request maked as idle and will not process before reset success. As
fabric_q shares tagset with admin_q, while reconnect remote target, we
need a tag for connect command, but the only one reserved tag was held
by keep alive command which waiting inside admin_q. As a result, we
failed to reconnect admin_q forever. In order to fix this issue, I think
we should keep two reserved tags for admin queue.
Rest of the patch looks good and follows the discussion on V1.
Reviewed-by: Chaitanya Kulkarni <kch@...dia.com>
-ck
Powered by blists - more mailing lists