[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aLr4-V8V1ZWGMrOj@linux.alibaba.com>
Date: Fri, 5 Sep 2025 22:51:37 +0800
From: Dust Li <dust.li@...ux.alibaba.com>
To: Halil Pasic <pasic@...ux.ibm.com>
Cc: Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>,
"D. Wythe" <alibuda@...ux.alibaba.com>,
Sidraya Jayagond <sidraya@...ux.ibm.com>,
Wenjia Zhang <wenjia@...ux.ibm.com>,
Mahanta Jambigi <mjambigi@...ux.ibm.com>,
Tony Lu <tonylu@...ux.alibaba.com>,
Wen Gu <guwen@...ux.alibaba.com>, netdev@...r.kernel.org,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-rdma@...r.kernel.org, linux-s390@...r.kernel.org
Subject: Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
On 2025-09-05 22:22:48, Dust Li wrote:
>On 2025-09-05 14:01:35, Halil Pasic wrote:
>>On Fri, 5 Sep 2025 11:00:59 +0200
>>Halil Pasic <pasic@...ux.ibm.com> wrote:
>>
>>> > 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
>>> > IIUC, For example, if the client sets max_send_wr to 64, but the server sets
>>> > max_recv_wr to 16, the client might overflow the server's QP receive
>>> > queue, potentially causing an RNR (Receiver Not Ready) error.
>>>
>>> I don't think the 16 is spec-ed anywhere and if the client and the server
>>> need to agree on the same value it should either be speced, or a
>>> protocol mechanism for negotiating it needs to exist. So what is your
>>> take on this as an SMC maintainer?
>
>Right — I didn't realize that either until I saw this patch today :)
>But since the implementation's been set to 16 since day one, bumping it
>up might break things.
>
>>>
>>> I think, we have tested heterogeneous setups and didn't see any grave
>>> issues. But let me please do a follow up on this. Maybe the other
>>> maintainers can chime in as well.
>
>I'm glad to hear from others.
>
>>
>>Did some research and some thinking. Are you concerned about a
>>performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
>>to my current understanding the RNR must not lead to a catastrophic
>>failure, but the RDMA/IB stack is supposed to handle that.
>
>No, it's not just a performance regression.
>If we get an RNR when going from 64 -> 16, the whole link group gets
>torn down — and all SMC connections inside it break.
>So from the user’s point of view, connections will just randomly drop
>out of nowhere.
I double-checked the code and noticed we set qp_attr.rnr_retry =
SMC_QP_RNR_RETRY = 7, which means "infinite retries."
So the QP will just keep retrying — we won't actually get an RNR.
That said, yeah, just performance regression.
So in this case, I would regard it as acceptable. We can go with this.
Best regards,
Dust
Powered by blists - more mailing lists