lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izMyhMFA5DwBmHNJpEfPLE6xUmA453V+tF4pdWAenbrV3w@mail.gmail.com>
Date: Mon, 28 Jul 2025 11:54:12 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Pavel Begunkov <asml.silence@...il.com>
Cc: Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org, io-uring@...r.kernel.org, 
	Eric Dumazet <edumazet@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Paolo Abeni <pabeni@...hat.com>, andrew+netdev@...n.ch, horms@...nel.org, 
	davem@...emloft.net, sdf@...ichev.me, dw@...idwei.uk, 
	michael.chan@...adcom.com, dtatulea@...dia.com, ap420073@...il.com
Subject: Re: [RFC v1 00/22] Large rx buffer support for zcrx

On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@...il.com> wrote:
>
> This series implements large rx buffer support for io_uring/zcrx on
> top of Jakub's queue configuration changes, but it can also be used
> by other memory providers. Large rx buffers can be drastically
> beneficial with high-end hw-gro enabled cards that can coalesce traffic
> into larger pages, reducing the number of frags traversing the network
> stack and resuling in larger contiguous chunks of data for the
> userspace. Benchamrks showed up to ~30% improvement in CPU util.
>

Very exciting.

I have not yet had a chance to thoroughly look, but even still I have
a few high level questions/concerns. Maybe you already have answers to
them that can make my life a bit easier as I try to take a thorough
look.

- I'm a bit confused that you're not making changes to the core net
stack to support non-PAGE_SIZE netmems. From a quick glance, it seems
that there are potentially a ton of places in the net stack that
assume PAGE_SIZE:

cd net
ackc "PAGE_SIZE|PAGE_SHIFT" | wc -l
468

Are we sure none of these places assuming PAGE_SIZE or PAGE_SHIFT are
concerning?

- You're not adding a field in the net_iov that tells us how big the
net_iov is. It seems to me you're configuring the driver to set the rx
buffer size, then assuming all the pp allocations are of that size,
then assuming in the rxzc code that all the net_iov are of that size.
I think a few problems may happen?

(a) what happens if the rx buffer size is re-configured? Does the
io_uring rxrc instance get recreated as well?
(b) what happens with skb coalescing? skb coalescing is already a bit
of a mess. We don't allow coalescing unreadable and readable skbs, but
we do allow coalescing devmem and iozcrx skbs which could lead to some
bugs I'm guessing already. AFAICT as of this patch series we may allow
coalescing of skbs with netmems inside of them of different sizes, but
AFAICT so far, the iozcrx assume the size is constant across all the
netmems it gets, which I'm not sure is always true?

For all these reasons I had assumed that we'd need space in the
net_iov that tells us its size: net_iov->size.

And then netmem_size(netmem) would replace all the PAGE_SIZE
assumptions in the net stack, and then we'd disallow coalescing of
skbs with different-sized netmems (else we need to handle them
correctly per the netmem_size).

-- 
Thanks,
Mina

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ