lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADtkEeeiNDO87L9MwC392gEp7YhhGGxojRu8nW_epkTe-jxcyg@mail.gmail.com>
Date: Thu, 7 Mar 2024 18:32:27 +0800
From: 许春光 <brookxu.cn@...il.com>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: kbusch@...nel.org, axboe@...nel.dk, hch@....de, 
	linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] nvme: fix reconnection fail due to reserved tag allocation

Thanks for review, seems that we should revert this patch
ed01fee283a0, ed01fee283a0 seems just a alone 'optimization'.  If no
double, I will send another patch.

Thanks

Sagi Grimberg <sagi@...mberg.me> 于2024年3月7日周四 17:36写道:
>
>
>
> On 28/02/2024 11:14, brookxu.cn wrote:
> > From: Chunguang Xu <chunguang.xu@...pee.com>
> >
> > We found a issue on production environment while using NVMe
> > over RDMA, admin_q reconnect failed forever while remote
> > target and network is ok. After dig into it, we found it
> > may caused by a ABBA deadlock due to tag allocation. In my
> > case, the tag was hold by a keep alive request waiting
> > inside admin_q, as we quiesced admin_q while reset ctrl,
> > so the request maked as idle and will not process before
> > reset success. As fabric_q shares tagset with admin_q,
> > while reconnect remote target, we need a tag for connect
> > command, but the only one reserved tag was held by keep
> > alive command which waiting inside admin_q. As a result,
> > we failed to reconnect admin_q forever.
> >
> > In order to workaround this issue, I think we should not
> > retry keep alive request while controller reconnecting,
> > as we have stopped keep alive while resetting controller,
> > and will start it again while init finish, so it maybe ok
> > to drop it.
>
> This is the wrong fix.
> First we should note that this is a regression caused by:
> ed01fee283a0 ("nvme-fabrics: only reserve a single tag")
>
> Then, you need to restore reserving two tags for the admin
> tagset.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ