[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251016161229.GM3938986@ziepe.ca>
Date: Thu, 16 Oct 2025 13:12:29 -0300
From: Jason Gunthorpe <jgg@...pe.ca>
To: Haakon Bugge <haakon.bugge@...cle.com>
Cc: Sean Hefty <shefty@...dia.com>, Jacob Moroni <jmoroni@...gle.com>,
Leon Romanovsky <leon@...nel.org>,
Vlad Dumitrescu <vdumitrescu@...dia.com>,
Or Har-Toov <ohartoov@...dia.com>,
Manjunath Patil <manjunath.b.patil@...cle.com>,
OFED mailing list <linux-rdma@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error
message
On Thu, Oct 16, 2025 at 03:25:15PM +0000, Haakon Bugge wrote:
>
>
> > On 15 Oct 2025, at 20:45, Jason Gunthorpe <jgg@...pe.ca> wrote:
> >
> > On Wed, Oct 15, 2025 at 06:34:33PM +0000, Sean Hefty wrote:
> >>>> With this hack, running cmtime with 10.000 connections in loopback,
> >>>> the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out.
> >>>> state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill
> >>>> cmtime because it was hanging, and then it got defunct with the
> >>>> following stack:
> >>>
> >>> Seems like a bug, it should not hang forever if a MAD is lost..
> >>
> >> The hack skipped calling ib_post_send. But the result of that is a
> >> completion is never written to the CQ.
>
>
> Which is exactly the behaviour I see when the VF gets "whacked". This is from a system without the reproducer hack. Looking at the netdev detected TX timeout:
>
> mlx5_core 0000:af:00.2 ens4f2: TX timeout detected
> mlx5_core 0000:af:00.2 ens4f2: TX timeout on queue: 0, SQ: 0xe31ee, CQ: 0x484, SQ Cons: 0x0 SQ Prod: 0x7, usecs since last trans: 18439000
> mlx5_core 0000:af:00.2 ens4f2: EQ 0x7: Cons = 0x3ded47a, irqn = 0x197
>
> (I get tons of the like)
>
> There are two points here. All of them has "SQ Cons: 0x0", which to me implies that no TX CQE has ever been polled for any of them.
> The other point is that we do _not_ see "Recovered %d eqes on EQ
> 0x%x" (which is because mlx5_eq_poll_irq_disabled() always returns
> zero), which means that either a) no CQE has been generated by the
> HCA or b) a CQE has been generated but no corresponding EQE has been
> written to the EQ.
Lost interrupts/cqe are an obnoxiously common bug in virtualization
environments. Be sure you are running latest NIC firmware. Be sure you
have all the qemu/kvm fixes.
But yes, if you hit these bugs then the QP gets effectively stuck
forever.
We don't have a stuck QP watchdog for the GMP QP, IIRC. Perhaps we
should, but I'd also argue if you are loosing interrupts for GMP QPs
then your VM platform is so broken it won't succeed to run normal RDMA
applications :\
At the end of the day you must not have these "TX timeout" type
errors, they are very very serious. Whatever bugs cause them must be
squashed.
> >> The state machine or
> >> reference counting is likely waiting for the completion, so it knows
> >> that HW is done trying to access the buffer.
> >
> > That does make sense, it has to immediately trigger the completion to
> > be accurate. A better test would be to truncate the mad or something
> > so it can't be rx'd
>
> As argued above, I think my reproducer hack is sound and to the point.
Not quite, you are just loosing CQEs. We should never loose a CQE.
Yes perhaps your QP can become permanently stuck, and that's bad. But
the fix is to detect the stuck QP, push it through to error and drain
it generating all the err CQs without any loss.
To better model what you are seeing you want to do something like
randomly drop the GMP QP doorbell ring, that will cause the QP to get
stuck similar to a lost interrupt/etc.
Jason
Powered by blists - more mailing lists