linux-kernel - Re: [PATCH] rdma_rxe: call comp_handler without holding cq->cq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a077b829-7fed-4f72-a854-f17759867604@linux.dev>
Date: Fri, 15 Aug 2025 11:29:38 -0700
From: "Yanjun.Zhu" <yanjun.zhu@...ux.dev>
To: Daisuke Matsuda <dskmtsd@...il.com>,
 Philipp Reisner <philipp.reisner@...bit.com>
Cc: Zhu Yanjun <zyjzyj2000@...il.com>, Jason Gunthorpe <jgg@...pe.ca>,
 Leon Romanovsky <leon@...nel.org>, linux-rdma@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] rdma_rxe: call comp_handler without holding cq->cq_lock


On 8/14/25 7:07 AM, Daisuke Matsuda wrote:
> On 2025/08/14 14:33, Zhu Yanjun wrote:
>> 在 2025/8/12 8:54, Daisuke Matsuda 写道:
>>> On 2025/08/11 22:48, Zhu Yanjun wrote:
>>>> 在 2025/8/10 22:26, Philipp Reisner 写道:
>>>>> On Thu, Aug 7, 2025 at 3:09 AM Zhu Yanjun <yanjun.zhu@...ux.dev> 
>>>>> wrote:
>>>>>>
>>>>>> 在 2025/8/6 5:39, Philipp Reisner 写道:
>>>>>>> Allow the comp_handler callback implementation to call 
>>>>>>> ib_poll_cq().
>>>>>>> A call to ib_poll_cq() calls rxe_poll_cq() with the rdma_rxe 
>>>>>>> driver.
>>>>>>> And rxe_poll_cq() locks cq->cq_lock. That leads to a spinlock 
>>>>>>> deadlock.
>>>>>>>
>>>>>>> The Mellanox and Intel drivers allow a comp_handler callback
>>>>>>> implementation to call ib_poll_cq().
>>>>>>>
>>>>>>> Avoid the deadlock by calling the comp_handler callback without
>>>>>>> holding cq->cw_lock.
>>>>>>>
>>>>>>> Signed-off-by: Philipp Reisner <philipp.reisner@...bit.com>
>>>>>>
>>>>>> ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
>>>>>> Test resize CQ, start with specific value and then increase and 
>>>>>> decrease
>>>>>> ---------------------------------------------------------------------- 
>>>>>>
>>>>>> Traceback (most recent call last):
>>>>>>     File "/root/deb/rdma-core/tests/test_cq.py", line 135, in 
>>>>>> test_resize_cq
>>>>>>       u.poll_cq(self.client.cq)
>>>>>>     File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
>>>>>>       wcs = _poll_cq(cq, count, data)
>>>>>>             ^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>>     File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
>>>>>>       raise PyverbsError(f'Got timeout on polling ({count} CQEs 
>>>>>> remaining)')
>>>>>> pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
>>>>>> remaining)
>>>>>>
>>>>>> After I applied your patch in kervel v6.16, I got the above errors.
>>>>>>
>>>>>> Zhu Yanjun
>>>>>>
>>>>>
>>>>> Hello Zhu,
>>>>>
>>>>> When I run the test_resize_cq test in a loop (100 runs each) on the
>>>>> original code and with my patch, I get about the same failure rate.
>>>>
>>>> Add Daisuke Matsuda
>>>>
>>>> If I remember it correctly, when Daisuke and I discussed ODP 
>>>> patches, we both made tests with rxe, from our tests results, it 
>>>> seems that this test_resize_cq error does not occur.
>>>
>>> Hi Zhu and Philipp,
>>>
>>> As far as I know, this error has been present for some time.
>>> It might be possible to investigate further by capturing a memory 
>>> dump while the polling is stuck, but I have not had time to do that 
>>> yet.
>>> At least, I can confirm that this is not a regression caused by 
>>> Philipp's patch.
>>
>> Hi, Daisuke
>>
>> Thanks a lot. I’m now able to consistently reproduce this problem. I 
>> have created a commit here: 
>> https://github.com/zhuyj/linux/commit/8db3abc00bf49cac6ea1d5718d28c6516c94fb4e.
>>
>> After applying this commit, I ran test_resize_cq 10,000 times, and 
>> the problem did not occur.
>>
>> I’m not sure if there’s a better way to fix this issue. If anyone has 
>> a better solution, please share it.
>
> Hi Zhu,
>
> Thank you very much for the investigation.
>
> I agree that the issue can be worked around by adding a delay in the 
> rxe completer path.
> However, since the issue is easily reproducible, introducing an 
> explicit sleep might
> add unnecessary overhead. I think a short busy-wait would be a more 
> desirable alternative.
>
> The intermediate change below does make the issue disappear on my 
> node, but I don't think
> this is a complete solution. In particular, it appears that 
> ibcq->event_handler() —
> typically ib_uverbs_cq_event_handler() — is not re-entrant, so simply 
> spinning like this
> could be risky.
>
> ===
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c 
> b/drivers/infiniband/sw/rxe/rxe_comp.c
> index a5b2b62f596b..a10a173e53cf 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -454,7 +454,7 @@ static void do_complete(struct rxe_qp *qp, struct 
> rxe_send_wqe *wqe)
>         queue_advance_consumer(qp->sq.queue, QUEUE_TYPE_FROM_CLIENT);
>
>         if (post)
> -               rxe_cq_post(qp->scq, &cqe, 0);
> +               while (rxe_cq_post(qp->scq, &cqe, 0) == -EBUSY);
>
>         if (wqe->wr.opcode == IB_WR_SEND ||
>             wqe->wr.opcode == IB_WR_SEND_WITH_IMM ||
> ===
>
> If you agree with this direction, I can take some time in the next 
> week or so to make a
> formal patch. Of course, you are welcome to take over this idea if you 
> prefer.


Thanks for building on top of my earlier proposal and for sharing your 
findings. I appreciate the effort you’ve put into exploring this approach.

That said, there are a few concerns we should address before moving forward:

Busy-wait duration – It’s difficult to guarantee that the busy-wait will 
remain short. If it lasts too long, we may hit the “CPU is locked for 
too long” warnings, which could impact system responsiveness.

Placement of busy-wait – The current implementation adds the busy-wait 
after -EBUSY is returned, rather than at the exact location where it 
occurs. This may hide the actual contention source and could introduce 
side effects in other paths.

Reentrancy risk – Since ibcq->event_handler() (usually 
ib_uverbs_cq_event_handler()) is not re-entrant, spinning inside it 
could be risky and lead to subtle bugs.

I think the idea of avoiding a full sleep makes sense, but perhaps we 
could look into alternative approaches — for example, an adaptive delay 
mechanism or handling the -EBUSY condition closer to where it originates.

If you’re interested in pursuing this further, I’d be happy to review an 
updated patch. I believe this direction still has potential if we 
address the above points.

Thanks again for your contribution, and I look forward to your next version.
Best regards,

Yanjun.Zhu

>
> Thanks,
> Daisuke
>
>>
>> Thanks a lot.
>> Zhu Yanjun
>>
>>>
>>> Thanks,
>>> Daisuke
>>>
>>
>