lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260209075338.GA61095@j66a10360.sqa.eu95>
Date: Mon, 9 Feb 2026 15:53:38 +0800
From: "D. Wythe" <alibuda@...ux.alibaba.com    >
To: Mahanta Jambigi <mjambigi@...ux.ibm.com>
Cc: "D. Wythe" <alibuda@...ux.alibaba.com>,
	"David S. Miller" <davem@...emloft.net>,
	Dust Li <dust.li@...ux.alibaba.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Sidraya Jayagond <sidraya@...ux.ibm.com>,
	Wenjia Zhang <wenjia@...ux.ibm.com>,
	Simon Horman <horms@...nel.org>, Tony Lu <tonylu@...ux.alibaba.com>,
	Wen Gu <guwen@...ux.alibaba.com>, linux-kernel@...r.kernel.org,
	linux-rdma@...r.kernel.org, linux-s390@...r.kernel.org,
	netdev@...r.kernel.org, oliver.yang@...ux.alibaba.com,
	pasic@...ux.ibm.com
Subject: Re: [PATCH RFC net-next] net/smc: transition to RDMA core CQ pooling

On Fri, Feb 06, 2026 at 04:58:23PM +0530, Mahanta Jambigi wrote:
> 
> 
> On 02/02/26 3:18 pm, D. Wythe wrote:
> > The current SMC-R implementation relies on global per-device CQs
> > and manual polling within tasklets, which introduces severe
> > scalability bottlenecks due to global lock contention and tasklet
> > scheduling overhead, resulting in poor performance as concurrency
> > increases.
> > 
> > Refactor the completion handling to utilize the ib_cqe API and
> > standard RDMA core CQ pooling. This transition provides several key
> > advantages:
> > 
> > 1. Multi-CQ: Shift from a single shared per-device CQ to multiple
> > link-specific CQs via the CQ pool. This allows completion processing
> > to be parallelized across multiple CPU cores, effectively eliminating
> > the global CQ bottleneck.
> > 
> > 2. Leverage DIM: Utilizing the standard CQ pool with IB_POLL_SOFTIRQ
> > enables Dynamic Interrupt Moderation from the RDMA core, optimizing
> > interrupt frequency and reducing CPU load under high pressure.
> > 
> > 3. O(1) Context Retrieval: Replaces the expensive wr_id based lookup
> > logic (e.g., smc_wr_tx_find_pending_index) with direct context retrieval
> > using container_of() on the embedded ib_cqe.
> > 
> > 4. Code Simplification: This refactoring results in a reduction of
> > ~150 lines of code. It removes redundant sequence tracking, complex lookup
> > helpers, and manual CQ management, significantly improving maintainability.
> > 
> > Performance Test: redis-benchmark with max 32 connections per QP
> > Data format: Requests Per Second (RPS), Percentage in brackets
> > represents the gain/loss compared to TCP.
> > 
> > | Clients | TCP      | SMC (original)      | SMC (cq_pool)       |
> > |---------|----------|---------------------|---------------------|
> > | c = 1   | 24449    | 31172  (+27%)       | 34039  (+39%)       |
> > | c = 2   | 46420    | 53216  (+14%)       | 64391  (+38%)       |
> > | c = 16  | 159673   | 83668  (-48%)  <--  | 216947 (+36%)       |
> > | c = 32  | 164956   | 97631  (-41%)  <--  | 249376 (+51%)       |
> > | c = 64  | 166322   | 118192 (-29%)  <--  | 249488 (+50%)       |
> > | c = 128 | 167700   | 121497 (-27%)  <--  | 249480 (+48%)       |
> > | c = 256 | 175021   | 146109 (-16%)  <--  | 240384 (+37%)       |
> > | c = 512 | 168987   | 101479 (-40%)  <--  | 226634 (+34%)       |
> > 
> > The results demonstrate that this optimization effectively resolves the
> > scalability bottleneck, with RPS increasing by over 110% at c=64
> > compared to the original implementation.
> 
> I applied your patch to the latest kernel(6.19-rc8) & saw below
> Performance results:
> 
> 1) In my evaluation, I ran several *uperf* based workloads using a
> request/response (RR) pattern, and I observed performance *degradation*
> ranging from *4%* to *59%*, depending on the specific read/write sizes
> used. For example, with a TCP RR workload using 50 parallel clients
> (nprocs=50) sending a 200‑byte request and reading a 1000‑byte response
> over a 60‑second run, I measured approximately 59% degradation compared
> to SMC‑R original performance.
>

The only setting I changed was net.smc.smcr_max_conns_per_lgr = 32, all
other parameters were left at their default values. redis-benchmark is a
classic Request/Response (RR) workload, which contradicts your test
results. Since I'm unable to reproduce your results, it would be
very helpful if you could share the specific test configuration for my
analysis.

Thanks,
D. Wythe

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ