linux-kernel - Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20B80D1B-6D11-446E-8EA2-5696FA66494C@oracle.com>
Date: Tue, 21 Oct 2025 16:32:54 +0000
From: Haakon Bugge <haakon.bugge@...cle.com>
To: Jason Gunthorpe <jgg@...pe.ca>
CC: Sean Hefty <shefty@...dia.com>, Jacob Moroni <jmoroni@...gle.com>,
        Leon
 Romanovsky <leon@...nel.org>,
        Vlad Dumitrescu <vdumitrescu@...dia.com>,
        Or
 Har-Toov <ohartoov@...dia.com>,
        Manjunath Patil
	<manjunath.b.patil@...cle.com>,
        OFED mailing list
	<linux-rdma@...r.kernel.org>,
        "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error
 message

> On 16 Oct 2025, at 20:01, Jason Gunthorpe <jgg@...pe.ca> wrote:
> 
> On Thu, Oct 16, 2025 at 04:43:16PM +0000, Haakon Bugge wrote:
> 
>> Well, I started off this thread thinking a cm_deref_id() was missing
>> somewhere, but now I am more inclined to think as you do, this is an
>> unrecoverable situation, and I should work with NVIDIA to fix it.
> 
> If the VF is just stuck and not progressing QPs for whatever reason
> then yes absolutely.

We are running with MULTI_PORT_VHCA_EN=1 (i.e., one device, two ports), and I see that it is only one of the ports in the function that get into this situation. And yes, mlnx ticket filed.

> At best all we can do is detect stuck QPs and try to recover them as I
> described.

It applies to all QPs, not only GSI MADs, and, as reported above, new QPs created from user-space run into the same situation. I tried an FLR, but the RDMA stack is not able to recover from it. So, from my perspective, only a reboot helps. In other words, unrecoverable from a SW perspective.

> How hard/costly would it be to have a tx timer watchdog on the mad
> layer send q?

May be Steve can answer that. But from my perspective, the "destroy CM ID timeout error" message is _the_ signature of the situation. And, anyone seeing it would probably read though this thread...

> At the very least we could log a stuck MAD QP..

That won't hurt, but I do not expect all other ULPs and user-space apps to handle the case where a CQE is expected but never comes.

Thxs, Håkon