linux-kernel - Re: [PATCH v2] nvme: fix reconnection fail due to reserved tag allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <800a2a54-9156-482f-9bc7-e99eea7dad5a@nvidia.com>
Date: Fri, 8 Mar 2024 00:19:47 +0000
From: Chaitanya Kulkarni <chaitanyak@...dia.com>
To: brookxu.cn <brookxu.cn@...il.com>, "kbusch@...nel.org"
	<kbusch@...nel.org>, "axboe@...nel.dk" <axboe@...nel.dk>, "hch@....de"
	<hch@....de>, "sagi@...mberg.me" <sagi@...mberg.me>
CC: "linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] nvme: fix reconnection fail due to reserved tag
 allocation

On 3/7/24 03:06, brookxu.cn wrote:
> From: Chunguang Xu <chunguang.xu@...pee.com>
>
> We found a issue on production environment while using NVMe
> over RDMA, admin_q reconnect failed forever while remote
> target and network is ok. After dig into it, we found it
> may caused by a ABBA deadlock due to tag allocation. In my
> case, the tag was hold by a keep alive request waiting
> inside admin_q, as we quiesced admin_q while reset ctrl,
> so the request maked as idle and will not process before
> reset success. As fabric_q shares tagset with admin_q,
> while reconnect remote target, we need a tag for connect
> command, but the only one reserved tag was held by keep
> alive command which waiting inside admin_q. As a result,
> we failed to reconnect admin_q forever. In order to fix
> this issue, I think we should keep two reserved tags for
> admin queue.

plz consider rearranged line length, no change in wording to use the
full length :-

We found a issue on production environment while using NVMe over RDMA,
admin_q reconnect failed forever while remote target and network is ok.
After dig into it, we found it may caused by a ABBA deadlock due to tag
allocation. In my case, the tag was hold by a keep alive request
waiting inside admin_q, as we quiesced admin_q while reset ctrl, so the
request maked as idle and will not process before reset success. As
fabric_q shares tagset with admin_q, while reconnect remote target, we
need a tag for connect command, but the only one reserved tag was held
by keep alive command which waiting inside admin_q. As a result, we
failed to reconnect admin_q forever. In order to fix this issue, I think
we should keep two reserved tags for admin queue.

Rest of the patch looks good and follows the discussion on V1.

Reviewed-by: Chaitanya Kulkarni <kch@...dia.com>

-ck