[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a077b829-7fed-4f72-a854-f17759867604@linux.dev>
Date: Fri, 15 Aug 2025 11:29:38 -0700
From: "Yanjun.Zhu" <yanjun.zhu@...ux.dev>
To: Daisuke Matsuda <dskmtsd@...il.com>,
Philipp Reisner <philipp.reisner@...bit.com>
Cc: Zhu Yanjun <zyjzyj2000@...il.com>, Jason Gunthorpe <jgg@...pe.ca>,
Leon Romanovsky <leon@...nel.org>, linux-rdma@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] rdma_rxe: call comp_handler without holding cq->cq_lock
On 8/14/25 7:07 AM, Daisuke Matsuda wrote:
> On 2025/08/14 14:33, Zhu Yanjun wrote:
>> 在 2025/8/12 8:54, Daisuke Matsuda 写道:
>>> On 2025/08/11 22:48, Zhu Yanjun wrote:
>>>> 在 2025/8/10 22:26, Philipp Reisner 写道:
>>>>> On Thu, Aug 7, 2025 at 3:09 AM Zhu Yanjun <yanjun.zhu@...ux.dev>
>>>>> wrote:
>>>>>>
>>>>>> 在 2025/8/6 5:39, Philipp Reisner 写道:
>>>>>>> Allow the comp_handler callback implementation to call
>>>>>>> ib_poll_cq().
>>>>>>> A call to ib_poll_cq() calls rxe_poll_cq() with the rdma_rxe
>>>>>>> driver.
>>>>>>> And rxe_poll_cq() locks cq->cq_lock. That leads to a spinlock
>>>>>>> deadlock.
>>>>>>>
>>>>>>> The Mellanox and Intel drivers allow a comp_handler callback
>>>>>>> implementation to call ib_poll_cq().
>>>>>>>
>>>>>>> Avoid the deadlock by calling the comp_handler callback without
>>>>>>> holding cq->cw_lock.
>>>>>>>
>>>>>>> Signed-off-by: Philipp Reisner <philipp.reisner@...bit.com>
>>>>>>
>>>>>> ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
>>>>>> Test resize CQ, start with specific value and then increase and
>>>>>> decrease
>>>>>> ----------------------------------------------------------------------
>>>>>>
>>>>>> Traceback (most recent call last):
>>>>>> File "/root/deb/rdma-core/tests/test_cq.py", line 135, in
>>>>>> test_resize_cq
>>>>>> u.poll_cq(self.client.cq)
>>>>>> File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
>>>>>> wcs = _poll_cq(cq, count, data)
>>>>>> ^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
>>>>>> raise PyverbsError(f'Got timeout on polling ({count} CQEs
>>>>>> remaining)')
>>>>>> pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
>>>>>> remaining)
>>>>>>
>>>>>> After I applied your patch in kervel v6.16, I got the above errors.
>>>>>>
>>>>>> Zhu Yanjun
>>>>>>
>>>>>
>>>>> Hello Zhu,
>>>>>
>>>>> When I run the test_resize_cq test in a loop (100 runs each) on the
>>>>> original code and with my patch, I get about the same failure rate.
>>>>
>>>> Add Daisuke Matsuda
>>>>
>>>> If I remember it correctly, when Daisuke and I discussed ODP
>>>> patches, we both made tests with rxe, from our tests results, it
>>>> seems that this test_resize_cq error does not occur.
>>>
>>> Hi Zhu and Philipp,
>>>
>>> As far as I know, this error has been present for some time.
>>> It might be possible to investigate further by capturing a memory
>>> dump while the polling is stuck, but I have not had time to do that
>>> yet.
>>> At least, I can confirm that this is not a regression caused by
>>> Philipp's patch.
>>
>> Hi, Daisuke
>>
>> Thanks a lot. I’m now able to consistently reproduce this problem. I
>> have created a commit here:
>> https://github.com/zhuyj/linux/commit/8db3abc00bf49cac6ea1d5718d28c6516c94fb4e.
>>
>> After applying this commit, I ran test_resize_cq 10,000 times, and
>> the problem did not occur.
>>
>> I’m not sure if there’s a better way to fix this issue. If anyone has
>> a better solution, please share it.
>
> Hi Zhu,
>
> Thank you very much for the investigation.
>
> I agree that the issue can be worked around by adding a delay in the
> rxe completer path.
> However, since the issue is easily reproducible, introducing an
> explicit sleep might
> add unnecessary overhead. I think a short busy-wait would be a more
> desirable alternative.
>
> The intermediate change below does make the issue disappear on my
> node, but I don't think
> this is a complete solution. In particular, it appears that
> ibcq->event_handler() —
> typically ib_uverbs_cq_event_handler() — is not re-entrant, so simply
> spinning like this
> could be risky.
>
> ===
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c
> b/drivers/infiniband/sw/rxe/rxe_comp.c
> index a5b2b62f596b..a10a173e53cf 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -454,7 +454,7 @@ static void do_complete(struct rxe_qp *qp, struct
> rxe_send_wqe *wqe)
> queue_advance_consumer(qp->sq.queue, QUEUE_TYPE_FROM_CLIENT);
>
> if (post)
> - rxe_cq_post(qp->scq, &cqe, 0);
> + while (rxe_cq_post(qp->scq, &cqe, 0) == -EBUSY);
>
> if (wqe->wr.opcode == IB_WR_SEND ||
> wqe->wr.opcode == IB_WR_SEND_WITH_IMM ||
> ===
>
> If you agree with this direction, I can take some time in the next
> week or so to make a
> formal patch. Of course, you are welcome to take over this idea if you
> prefer.
Thanks for building on top of my earlier proposal and for sharing your
findings. I appreciate the effort you’ve put into exploring this approach.
That said, there are a few concerns we should address before moving forward:
Busy-wait duration – It’s difficult to guarantee that the busy-wait will
remain short. If it lasts too long, we may hit the “CPU is locked for
too long” warnings, which could impact system responsiveness.
Placement of busy-wait – The current implementation adds the busy-wait
after -EBUSY is returned, rather than at the exact location where it
occurs. This may hide the actual contention source and could introduce
side effects in other paths.
Reentrancy risk – Since ibcq->event_handler() (usually
ib_uverbs_cq_event_handler()) is not re-entrant, spinning inside it
could be risky and lead to subtle bugs.
I think the idea of avoiding a full sleep makes sense, but perhaps we
could look into alternative approaches — for example, an adaptive delay
mechanism or handling the -EBUSY condition closer to where it originates.
If you’re interested in pursuing this further, I’d be happy to review an
updated patch. I believe this direction still has potential if we
address the above points.
Thanks again for your contribution, and I look forward to your next version.
Best regards,
Yanjun.Zhu
>
> Thanks,
> Daisuke
>
>>
>> Thanks a lot.
>> Zhu Yanjun
>>
>>>
>>> Thanks,
>>> Daisuke
>>>
>>
>
Powered by blists - more mailing lists