netdev - Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <phuigc2himixvyaxydukgupqy2oxpobj6qo4m4hb6vsr5qenfd@7q4ct2c5gjdq>
Date: Thu, 29 May 2025 11:11:44 +0000
From: Dragos Tatulea <dtatulea@...dia.com>
To: Stanislav Fomichev <stfomichev@...il.com>, 
	Mina Almasry <almasrymina@...gle.com>
Cc: Tariq Toukan <tariqt@...dia.com>, 
	"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Eric Dumazet <edumazet@...gle.com>, 
	Andrew Lunn <andrew+netdev@...n.ch>, Saeed Mahameed <saeedm@...dia.com>, 
	Leon Romanovsky <leon@...nel.org>, Richard Cochran <richardcochran@...il.com>, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, John Fastabend <john.fastabend@...il.com>, 
	netdev@...r.kernel.org, linux-rdma@...r.kernel.org, linux-kernel@...r.kernel.org, 
	bpf@...r.kernel.org, Moshe Shemesh <moshe@...dia.com>, Mark Bloch <mbloch@...dia.com>, 
	Gal Pressman <gal@...dia.com>, Cosmin Ratiu <cratiu@...dia.com>
Subject: Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and
 io_uring TCP zero-copy

On Wed, May 28, 2025 at 04:04:18PM -0700, Stanislav Fomichev wrote:
> On 05/28, Mina Almasry wrote:
> > On Wed, May 28, 2025 at 8:45 AM Stanislav Fomichev <stfomichev@...il.com> wrote:
> > >
> > > On 05/28, Dragos Tatulea wrote:
> > > > On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> > > > > On 05/23, Tariq Toukan wrote:
> > > > > > This series from the team adds support for zerocopy rx TCP with devmem
> > > > > > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > > > > > simplicity HW-GRO will also be turned on when header-data split mode is
> > > > > > on.
> > > > > >
> > > > > > Find more details below.
> > > > > >
> > > > > > Regards,
> > > > > > Tariq
> > > > > >
> > > > > > Performance
> > > > > > ===========
> > > > > >
> > > > > > Test setup:
> > > > > >
> > > > > > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > > > > > * NIC: ConnectX7
> > > > > > * Benchmarking tool: kperf [1]
> > > > > > * Single TCP flow
> > > > > > * Test duration: 60s
> > > > > >
> > > > > > With application thread and interrupts pinned to the *same* core:
> > > > > >
> > > > > > |------+-----------+----------|
> > > > > > | MTU  | epoll     | io_uring |
> > > > > > |------+-----------+----------|
> > > > > > | 1500 | 61.6 Gbps | 114 Gbps |
> > > > > > | 4096 | 69.3 Gbps | 151 Gbps |
> > > > > > | 9000 | 67.8 Gbps | 187 Gbps |
> > > > > > |------+-----------+----------|
> > > > > >
> > > > > > The CPU usage for io_uring is 95%.
> > > > > >
> > > > > > Reproduction steps for io_uring:
> > > > > >
> > > > > > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> > > > > >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > > > > >
> > > > > > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > > > > >
> > > > > > client --src 2001:db8::2 --dst 2001:db8::1 \
> > > > > >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > > > > >
> > > > > > Patch overview:
> > > > > > ================
> > > > > >
> > > > > > First, a netmem API for skb_can_coalesce is added to the core to be able
> > > > > > to do skb fragment coalescing on netmems.
> > > > > >
> > > > > > The next patches introduce some cleanups in the internal SHAMPO code and
> > > > > > improvements to hw gro capability checks in FW.
> > > > > >
> > > > > > A separate page_pool is introduced for headers. Ethtool stats are added
> > > > > > as well.
> > > > > >
> > > > > > Then the driver is converted to use the netmem API and to allow support
> > > > > > for unreadable netmem page pool.
> > > > > >
> > > > > > The queue management ops are implemented.
> > > > > >
> > > > > > Finally, the tcp-data-split ring parameter is exposed.
> > > > > >
> > > > > > Changelog
> > > > > > =========
> > > > > >
> > > > > > Changes from v1 [0]:
> > > > > > - Added support for skb_can_coalesce_netmem().
> > > > > > - Avoid netmem_to_page() casts in the driver.
> > > > > > - Fixed code to abide 80 char limit with some exceptions to avoid
> > > > > > code churn.
> > > > >
> > > > > Since there is gonna be 2-3 weeks of closed net-next, can you
> > > > > also add a patch for the tx side? It should be trivial (skip dma unmap
> > > > > for niovs in tx completions plus netdev->netmem_tx=1).
> > > > >
> > > > Seems indeed trivial. We will add it.
> > > >
> > > > > And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> > > > > in this series?
> > > > >
> > > > > 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
> > > > We wanted to fix this afterwards as it needs to change a more subtle
> > > > part in the code that replenishes pages. This needs more thinking and
> > > > testing.
> > >
> > > Thanks! For my understanding: does the issue occur only during initial
> > > queue refill? Or the same problem will happen any time there is a burst
> > > of traffic that might exhaust all rx descriptors?
> > >
> > 
> > Minor: a burst in traffic likely won't reproduce this case, I'm sure
> > mlx5 can drive the hardware to line rate consistently. It's more if
> > the machine is under extreme memory pressure, I think,
> > page_pool_alloc_pages and friends may return ENOMEM, which reproduces
> > the same edge case as the dma-buf being extremely small which also
> > makes page_pool_alloc_netmems return -ENOMEM.
> 
> What I want to understand is whether the kernel/driver will oops when dmabuf
> runs out of buffers after initial setup. Either traffic burst and/or userspace
> being slow on refill - doesn't matter.
There is no OOPS but the queue can't handle more traffic because it
can't allocate more buffers and it can't release old buffers either.

AFAIU from Cosmin the condition happened on initial queue fill when
there are no buffers to be released for the current WQE.

Thanks,
Dragos