linux-kernel - Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250926123028.2130fa49.pasic@linux.ibm.com>
Date: Fri, 26 Sep 2025 12:30:28 +0200
From: Halil Pasic <pasic@...ux.ibm.com>
To: Guangguan Wang <guangguan.wang@...ux.alibaba.com>
Cc: Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
        Simon
 Horman <horms@...nel.org>,
        "D. Wythe" <alibuda@...ux.alibaba.com>,
        Dust Li
 <dust.li@...ux.alibaba.com>,
        Sidraya Jayagond <sidraya@...ux.ibm.com>,
        Wenjia Zhang <wenjia@...ux.ibm.com>,
        Mahanta Jambigi
 <mjambigi@...ux.ibm.com>,
        Tony Lu <tonylu@...ux.alibaba.com>, Wen Gu
 <guwen@...ux.alibaba.com>,
        netdev@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-rdma@...r.kernel.org,
        linux-s390@...r.kernel.org, Halil Pasic
 <pasic@...ux.ibm.com>
Subject: Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count
 configurable

On Fri, 26 Sep 2025 12:12:49 +0200
Halil Pasic <pasic@...ux.ibm.com> wrote:

> On Fri, 26 Sep 2025 10:44:00 +0800
> Guangguan Wang <guangguan.wang@...ux.alibaba.com> wrote:
> 
> > > +
> > > +smcr_max_send_wr - INTEGER
> > > +	So called work request buffers are SMCR link (and RDMA queue pair) level
> > > +	resources necessary for performing RDMA operations. Since up to 255
> > > +	connections can share a link group and thus also a link and the number
> > > +	of the work request buffers is decided when the link is allocated,
> > > +	depending on the workload it can a bottleneck in a sense that threads
> > > +	have to wait for work request buffers to become available. Before the
> > > +	introduction of this control the maximal number of work request buffers
> > > +	available on the send path used to be hard coded to 16. With this control
> > > +	it becomes configurable. The acceptable range is between 2 and 2048.
> > > +
> > > +	Please be aware that all the buffers need to be allocated as a physically
> > > +	continuous array in which each element is a single buffer and has the size
> > > +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > > +	like before having this control.
> > > +
> > > +	Default: 16
> > > +
> > > +smcr_max_recv_wr - INTEGER
> > > +	So called work request buffers are SMCR link (and RDMA queue pair) level
> > > +	resources necessary for performing RDMA operations. Since up to 255
> > > +	connections can share a link group and thus also a link and the number
> > > +	of the work request buffers is decided when the link is allocated,
> > > +	depending on the workload it can a bottleneck in a sense that threads
> > > +	have to wait for work request buffers to become available. Before the
> > > +	introduction of this control the maximal number of work request buffers
> > > +	available on the receive path used to be hard coded to 16. With this control
> > > +	it becomes configurable. The acceptable range is between 2 and 2048.
> > > +
> > > +	Please be aware that all the buffers need to be allocated as a physically
> > > +	continuous array in which each element is a single buffer and has the size
> > > +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > > +	like before having this control.
> > > +
> > > +	Default: 48    
> > 
> > Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
> > intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
> > smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
> > smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
> > smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
> > the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
> > Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
> > happens in the transport layer of the RDMA).  

Sorry this was sent accidentally by the virtue of unintentionally
pressing the shortcut for send while trying to actually edit! 

> 
> Thank you Guangguan! I think we already had that discussion. 

Please have a look at this thread
https://lore.kernel.org/all/4c5347ff-779b-48d7-8234-2aac9992f487@linux.ibm.com/

I'm aware of this, but I think this problem needs to be solved on
a different level.

> > 
> > Let us guess a scenario that have multiple hosts, and the multiple hosts have different
> > smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
> > It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
> > QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
> > on the connected QPs between these hosts.  
> 
> 
> It is not difficult IMHO. You just leave the knobs alone and you have
[..]

It is not difficult IMHO. You just leave the knobs alone and you have
3:1 per default. If tuning is attempted that needs to be done carefully.
At least with SMC-R V2 there is this whole EID business, as well so it
is reasonable to assume that the environment can be tuned in a coherent
fashion. E.g. whoever is calling the EID could call use smcr_max_recv_wr:=32
and smcr_max_send_wr:=96. 

> > 
> > Therefore, I believe that if these values are made configurable, additional mechanisms must be
> > in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
> > and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
> > smcr_max_recv_wr and smcr_max_send_wr with the same values.  
> 

I'm in favor of adding such mechanisms on top of this. Do you have
something particular in mind? Unfortunately I'm not knowledgeable enough
in the area to know what mechanisms you may mean. But I guess it is
patches welcome as always! Currently I would encourage to users
to tune carefully. 

Sorry about that half baked answer before.

Regards,
Halil