linux-kernel - Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <D2E28412-CC9F-497E-BF81-2DB4A8BC1C5E@oracle.com>
Date: Thu, 25 Sep 2025 11:29:49 +0000
From: Haakon Bugge <haakon.bugge@...cle.com>
To: Jacob Moroni <jmoroni@...gle.com>
CC: Jason Gunthorpe <jgg@...pe.ca>, Leon Romanovsky <leon@...nel.org>,
        Sean
 Hefty <shefty@...dia.com>,
        Vlad Dumitrescu <vdumitrescu@...dia.com>,
        Or
 Har-Toov <ohartoov@...dia.com>,
        Manjunath Patil
	<manjunath.b.patil@...cle.com>,
        OFED mailing list
	<linux-rdma@...r.kernel.org>,
        "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error
 message

Hi Jason and Jake,


> On 16 Sep 2025, at 16:36, Jacob Moroni <jmoroni@...gle.com> wrote:
> 
> Does this happen when there is a missing send completion?
> 
> Asking because I remember triggering this if a device encounters an
> unrecoverable
> error/VF reset while under heavy RDMA-CM activity (like a large scale
> MPI wire-up).
> 
> I assumed it was because RDMA-CM was waiting for TX completions that
> would never arrive.
> 
> Of course, the unrecoverable error/VF reset without generating flush
> completions was the real
> bug in my case.


I concur. I looked ahead of the first incident, but didn't see any obscure mlx5 driver messages. But looking in-between, I saw:

kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout detected
kernel: cm_destroy_id_wait_timeout: cm_id=00000000564a7a31 timed out. state 2 -> 0, refcnt=2
kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout on queue: 12, SQ: 0x14f2a, CQ: 0x1739, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 30224000
kernel: cm_destroy_id_wait_timeout: cm_id=00000000b821dcda timed out. state 2 -> 0, refcnt=1
kernel: cm_destroy_id_wait_timeout: cm_id=00000000edf170fa timed out. state 2 -> 0, refcnt=1
kernel: mlx5_core 0000:13:01.1 ens4f16: EQ 0x14: Cons = 0x444670, irqn = 0x28c

Not in close proximity in time, but a 6 digits amount of messages were suppressed due to the flooding.

My take is that the timeout should be monotonic increasing from the driver to RDMA_CM (and to the ULPs). They are not, as the mlx5e_build_nic_netdev() functions sets the ndetdev's watchdog_timeo to 15 seconds, whereas the timeout value calling cm_destroy_id_wait_timeout() is 10 seconds.

So, the mitigation by detecting a TX timeout from netdev has not kicked in when cm_destroy_id_wait_timeout() is called.


Thxs, Håkon