lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2aced457-5f1e-4c1a-b5ea-035240f73aaf@linux.alibaba.com>
Date: Thu, 25 Sep 2025 11:48:46 +0800
From: Guangguan Wang <guangguan.wang@...ux.alibaba.com>
To: Halil Pasic <pasic@...ux.ibm.com>
Cc: Dust Li <dust.li@...ux.alibaba.com>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
 "D. Wythe" <alibuda@...ux.alibaba.com>,
 Sidraya Jayagond <sidraya@...ux.ibm.com>, Wenjia Zhang
 <wenjia@...ux.ibm.com>, Mahanta Jambigi <mjambigi@...ux.ibm.com>,
 Tony Lu <tonylu@...ux.alibaba.com>, Wen Gu <guwen@...ux.alibaba.com>,
 netdev@...r.kernel.org, linux-doc@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-rdma@...r.kernel.org,
 linux-s390@...r.kernel.org
Subject: Re: [PATCH net-next v2 1/2] net/smc: make wr buffer count
 configurable



在 2025/9/24 17:50, Halil Pasic 写道:
> On Wed, 24 Sep 2025 11:13:05 +0800
> Guangguan Wang <guangguan.wang@...ux.alibaba.com> wrote:
> 
>> 在 2025/9/19 22:55, Halil Pasic 写道:
>>> On Tue, 9 Sep 2025 12:18:50 +0200
>>> Halil Pasic <pasic@...ux.ibm.com> wrote:
>>>
>>>
>>> Can maybe Wen Gu and  Guangguan Wang chime in. From what I read
>>> link->wr_rx_buflen can be either SMC_WR_BUF_SIZE that is 48 in which
>>> case it does not matter, or SMC_WR_BUF_V2_SIZE that is 8192, if
>>> !smc_link_shared_v2_rxbuf(lnk) i.e. max_recv_sge == 1. So we talk
>>> about roughly a factor of 170 here. For a large pref_recv_wr the
>>> back of logic is still there to save us but I really would not say that
>>> this is how this is intended to work.
>>>   
>>
>> Hi Halil,
>>
>> I think the root cause of the problem this patchset try to solve is a mismatch
>> between SMC_WR_BUF_CNT and the max_conns per lgr(which value is 255). Furthermore,
>> I believe that value 255 of the max_conns per lgr is not an optimal value, as too
>> few connections lead to a waste of memory and too many connections lead to I/O queuing
>> within a single QP(every WR post_send to a single QP will initiate and complete in sequence).
>>
>> We actually identified this problem long ago. In Alibaba Cloud Linux distribution, we have
>> changed SMC_WR_BUF_CNT to 64 and reduced max_conns per lgr to 32(for SMC-R V2.1). This
>> configuration has worked well under various workflow for a long time.
>>
>> SMC-R V2.1 already support negotiation of the max_conns per lgr. Simply change the value of
>> the macro SMC_CONN_PER_LGR_PREFER can influence the negotiation result. But SMC-R V1.0 and SMC-R
>> v2.0 do not support the negotiation of the max_conns per lgr.
>> I think it is better to reduce SMC_CONN_PER_LGR_PREFER for SMC-R V2.1. But for SMC-R V1.0 and
>> SMC-R V2.0, I do not have any good idea.
>>
> 
> I agree, the number of WR buffers and the max number of connections per
> lgr can an should be tuned in concert.
> 
>>> Maybe not supporting V2 on devices with max_recv_sge is a better choice,
>>> assuming that a maximal V2 LLC msg needs to fit each and every receive
>>> WR buffer. Which seems to be the case based on 27ef6a9981fe ("net/smc:
>>> support SMC-R V2 for rdma devices with max_recv_sge equals to 1").
>>>  
>>
>> For rdma dev whose max_recv_sge is 1, as metioned in the commit log in the related patch,
>> it is better to support than SMC_CLC_DECL_INTERR fallback, as SMC_CLC_DECL_INTERR fallback
>> is not a fast fallback, and may heavily influence the efficiency of the connecting process
>> in both the server and client side.
> 
> I mean another possible mitigation of the problem can be the following,
> if there is a device in the mix with max_recv_sge < 2 the don't propose/
> accept SMCR-V2. 
> 
> Do you know how prevalent and relevant are max_recv_sge < 2 RDMA
> devices, and how likely is it that somebody would like to use SMC-R with
> such devices?
> 

eRDMA in Alibaba Cloud is max_recv_sge < 2, and it is the RDMA device we are primarily focusing on.
eRDMA prefer works on SMC-R V2.1, is it possible that supported in SMC-R V2.1 but not in V2.0? 

>>
>>  
>>> For me the best course of action seems to be to send a V3 using
>>> link->wr_rx_buflen. I'm really not that knowledgeable about RDMA or
>>> the SMC-R protocol, but I'm happy to be part of the discussion on this
>>> matter.
>>>
>>> Regards,
>>> Halil  
>>
>> And a tiny suggestion for the risk you mentioned in commit log
>> ("Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
>> risky, because the large-ish physically continuous allocation could fail
>> and lead to TCP fall-backs."). Non-physically continuous allocation (vmalloc/vzalloc .etc.) is
>> also supported for wr buffers. SMC-R snd_buf and rmb have already supported for non-physically
>> continuous memory, when sysctl_smcr_buf_type is set to SMCR_VIRT_CONT_BUFS or SMCR_MIXED_BUFS.
>> It can be an example of using non-physically continuous memory.
>>
> 
> I think we can put this on the list of possible enhancements. I would
> perfer to not add this to the scope of this series. But I would be happy to
> see this happen. Don't know know if somebody form Alibaba, or maybe
> Mahanta or Sid would like to pick this up as an enhancement on top.
> > Thank you very much for for your comments!
> 
> Regards,
> Halil 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ