[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <D2E28412-CC9F-497E-BF81-2DB4A8BC1C5E@oracle.com>
Date: Thu, 25 Sep 2025 11:29:49 +0000
From: Haakon Bugge <haakon.bugge@...cle.com>
To: Jacob Moroni <jmoroni@...gle.com>
CC: Jason Gunthorpe <jgg@...pe.ca>, Leon Romanovsky <leon@...nel.org>,
Sean
Hefty <shefty@...dia.com>,
Vlad Dumitrescu <vdumitrescu@...dia.com>,
Or
Har-Toov <ohartoov@...dia.com>,
Manjunath Patil
<manjunath.b.patil@...cle.com>,
OFED mailing list
<linux-rdma@...r.kernel.org>,
"linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error
message
Hi Jason and Jake,
> On 16 Sep 2025, at 16:36, Jacob Moroni <jmoroni@...gle.com> wrote:
>
> Does this happen when there is a missing send completion?
>
> Asking because I remember triggering this if a device encounters an
> unrecoverable
> error/VF reset while under heavy RDMA-CM activity (like a large scale
> MPI wire-up).
>
> I assumed it was because RDMA-CM was waiting for TX completions that
> would never arrive.
>
> Of course, the unrecoverable error/VF reset without generating flush
> completions was the real
> bug in my case.
I concur. I looked ahead of the first incident, but didn't see any obscure mlx5 driver messages. But looking in-between, I saw:
kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout detected
kernel: cm_destroy_id_wait_timeout: cm_id=00000000564a7a31 timed out. state 2 -> 0, refcnt=2
kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout on queue: 12, SQ: 0x14f2a, CQ: 0x1739, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 30224000
kernel: cm_destroy_id_wait_timeout: cm_id=00000000b821dcda timed out. state 2 -> 0, refcnt=1
kernel: cm_destroy_id_wait_timeout: cm_id=00000000edf170fa timed out. state 2 -> 0, refcnt=1
kernel: mlx5_core 0000:13:01.1 ens4f16: EQ 0x14: Cons = 0x444670, irqn = 0x28c
Not in close proximity in time, but a 6 digits amount of messages were suppressed due to the flooding.
My take is that the timeout should be monotonic increasing from the driver to RDMA_CM (and to the ULPs). They are not, as the mlx5e_build_nic_netdev() functions sets the ndetdev's watchdog_timeo to 15 seconds, whereas the timeout value calling cm_destroy_id_wait_timeout() is 10 seconds.
So, the mitigation by detecting a TX timeout from netdev has not kicked in when cm_destroy_id_wait_timeout() is called.
Thxs, HÃ¥kon
Powered by blists - more mailing lists