[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <24a398b3-e3e5-4b0d-8ed7-cd86f3e661eb@linux.alibaba.com>
Date: Sun, 28 Sep 2025 11:05:45 +0800
From: Guangguan Wang <guangguan.wang@...ux.alibaba.com>
To: Halil Pasic <pasic@...ux.ibm.com>
Cc: Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>, "D. Wythe" <alibuda@...ux.alibaba.com>,
Dust Li <dust.li@...ux.alibaba.com>, Sidraya Jayagond
<sidraya@...ux.ibm.com>, Wenjia Zhang <wenjia@...ux.ibm.com>,
Mahanta Jambigi <mjambigi@...ux.ibm.com>, Tony Lu
<tonylu@...ux.alibaba.com>, Wen Gu <guwen@...ux.alibaba.com>,
netdev@...r.kernel.org, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-rdma@...r.kernel.org,
linux-s390@...r.kernel.org
Subject: Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count
configurable
在 2025/9/26 18:30, Halil Pasic 写道:
> On Fri, 26 Sep 2025 12:12:49 +0200
> Halil Pasic <pasic@...ux.ibm.com> wrote:
>
>> On Fri, 26 Sep 2025 10:44:00 +0800
>> Guangguan Wang <guangguan.wang@...ux.alibaba.com> wrote:
>>
>>>
>>> Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
>>> intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
>>> smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
>>> smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
>>> smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
>>> the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
>>> Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
>>> happens in the transport layer of the RDMA).
>
> Sorry this was sent accidentally by the virtue of unintentionally
> pressing the shortcut for send while trying to actually edit!
>
>>
>> Thank you Guangguan! I think we already had that discussion.
>
> Please have a look at this thread
> https://lore.kernel.org/all/4c5347ff-779b-48d7-8234-2aac9992f487@linux.ibm.com/
>
> I'm aware of this, but I think this problem needs to be solved on
> a different level.
>
Oh, I see. Sorry for missing the previous discussion.
BTW, the RNR counter is the file like '/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rnr_nak_retry_err'.
>>>
>>> Let us guess a scenario that have multiple hosts, and the multiple hosts have different
>>> smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
>>> It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
>>> QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
>>> on the connected QPs between these hosts.
>>
>>
>> It is not difficult IMHO. You just leave the knobs alone and you have
> [..]
>
> It is not difficult IMHO. You just leave the knobs alone and you have
> 3:1 per default. If tuning is attempted that needs to be done carefully.
> At least with SMC-R V2 there is this whole EID business, as well so it
> is reasonable to assume that the environment can be tuned in a coherent
> fashion. E.g. whoever is calling the EID could call use smcr_max_recv_wr:=32
> and smcr_max_send_wr:=96.
>
>>>
>>> Therefore, I believe that if these values are made configurable, additional mechanisms must be
>>> in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
>>> and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
>>> smcr_max_recv_wr and smcr_max_send_wr with the same values.
>>
>
> I'm in favor of adding such mechanisms on top of this. Do you have
> something particular in mind? Unfortunately I'm not knowledgeable enough
> in the area to know what mechanisms you may mean. But I guess it is
> patches welcome as always! Currently I would encourage to users
> to tune carefully.
>
AFAIK, flow control is a usual way, maybe credit-based flow control is enough. Credit means the valid
counts of receive wr can be used. The receiver counts the credit every time post_recv, and advertises
credits to the connected sender at a certain frequency. The sender counts the credits advertised from
peer. The sender consumes a credit everytime post_send wr which will consume a receive wr in the receiver,
if have enough credits to consume. Otherwise the sender should hang the wr and should wait for the credits
advertised from peer.
But this requires support at the SMC-R protocol level. And this also can be addressed as an enhancement.
I do not known if someone from Dust Li's team or someone from IBM has interests to pick this up.
Regards,
Guangguan Wang
Powered by blists - more mailing lists