linux-kernel - Re: [PATCH] nvme: fix reconnection fail due to reserved tag allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6eae3879-f9d2-4fe3-96b1-c9e2aa939264@grimberg.me>
Date: Thu, 7 Mar 2024 11:36:13 +0200
From: Sagi Grimberg <sagi@...mberg.me>
To: "brookxu.cn" <brookxu.cn@...il.com>, kbusch@...nel.org, axboe@...nel.dk,
 hch@....de
Cc: linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] nvme: fix reconnection fail due to reserved tag
 allocation



On 28/02/2024 11:14, brookxu.cn wrote:
> From: Chunguang Xu <chunguang.xu@...pee.com>
>
> We found a issue on production environment while using NVMe
> over RDMA, admin_q reconnect failed forever while remote
> target and network is ok. After dig into it, we found it
> may caused by a ABBA deadlock due to tag allocation. In my
> case, the tag was hold by a keep alive request waiting
> inside admin_q, as we quiesced admin_q while reset ctrl,
> so the request maked as idle and will not process before
> reset success. As fabric_q shares tagset with admin_q,
> while reconnect remote target, we need a tag for connect
> command, but the only one reserved tag was held by keep
> alive command which waiting inside admin_q. As a result,
> we failed to reconnect admin_q forever.
>
> In order to workaround this issue, I think we should not
> retry keep alive request while controller reconnecting,
> as we have stopped keep alive while resetting controller,
> and will start it again while init finish, so it maybe ok
> to drop it.

This is the wrong fix.
First we should note that this is a regression caused by:
ed01fee283a0 ("nvme-fabrics: only reserve a single tag")

Then, you need to restore reserving two tags for the admin
tagset.