[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20B80D1B-6D11-446E-8EA2-5696FA66494C@oracle.com>
Date: Tue, 21 Oct 2025 16:32:54 +0000
From: Haakon Bugge <haakon.bugge@...cle.com>
To: Jason Gunthorpe <jgg@...pe.ca>
CC: Sean Hefty <shefty@...dia.com>, Jacob Moroni <jmoroni@...gle.com>,
Leon
Romanovsky <leon@...nel.org>,
Vlad Dumitrescu <vdumitrescu@...dia.com>,
Or
Har-Toov <ohartoov@...dia.com>,
Manjunath Patil
<manjunath.b.patil@...cle.com>,
OFED mailing list
<linux-rdma@...r.kernel.org>,
"linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error
message
> On 16 Oct 2025, at 20:01, Jason Gunthorpe <jgg@...pe.ca> wrote:
>
> On Thu, Oct 16, 2025 at 04:43:16PM +0000, Haakon Bugge wrote:
>
>> Well, I started off this thread thinking a cm_deref_id() was missing
>> somewhere, but now I am more inclined to think as you do, this is an
>> unrecoverable situation, and I should work with NVIDIA to fix it.
>
> If the VF is just stuck and not progressing QPs for whatever reason
> then yes absolutely.
We are running with MULTI_PORT_VHCA_EN=1 (i.e., one device, two ports), and I see that it is only one of the ports in the function that get into this situation. And yes, mlnx ticket filed.
> At best all we can do is detect stuck QPs and try to recover them as I
> described.
It applies to all QPs, not only GSI MADs, and, as reported above, new QPs created from user-space run into the same situation. I tried an FLR, but the RDMA stack is not able to recover from it. So, from my perspective, only a reboot helps. In other words, unrecoverable from a SW perspective.
> How hard/costly would it be to have a tx timer watchdog on the mad
> layer send q?
May be Steve can answer that. But from my perspective, the "destroy CM ID timeout error" message is _the_ signature of the situation. And, anyone seeing it would probably read though this thread...
> At the very least we could log a stuck MAD QP..
That won't hurt, but I do not expect all other ULPs and user-space apps to handle the case where a CQE is expected but never comes.
Thxs, HÃ¥kon
Powered by blists - more mailing lists