linux-kernel - Re: [PATCH net v1 2/2] gve: use max allowed ring size for ZC page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <qhi7uuq52irirmviv3xex6h5tc4w4x6kcjwhqh735un3kpcx5x@2phgy3mnmg4p>
Date: Thu, 6 Nov 2025 17:25:43 +0000
From: Dragos Tatulea <dtatulea@...dia.com>
To: Mina Almasry <almasrymina@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Joshua Washington <joshwash@...gle.com>, Harshitha Ramamurthy <hramamurthy@...gle.com>, 
	Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, 
	Simon Horman <horms@...nel.org>, Willem de Bruijn <willemb@...gle.com>, ziweixiao@...gle.com, 
	Vedant Mathur <vedantmathur@...gle.com>
Subject: Re: [PATCH net v1 2/2] gve: use max allowed ring size for ZC
 page_pools

On Wed, Nov 05, 2025 at 06:56:46PM -0800, Mina Almasry wrote:
> On Wed, Nov 5, 2025 at 6:22 PM Jakub Kicinski <kuba@...nel.org> wrote:
> >
> > On Wed, 5 Nov 2025 17:56:10 -0800 Mina Almasry wrote:
> > > On Wed, Nov 5, 2025 at 5:11 PM Jakub Kicinski <kuba@...nel.org> wrote:
> > > > On Wed,  5 Nov 2025 20:07:58 +0000 Mina Almasry wrote:
> > > > > NCCL workloads with NCCL_P2P_PXN_LEVEL=2 or 1 are very slow with the
> > > > > current gve devmem tcp configuration.
> > > >
> > > > Hardcoding the ring size because some other attribute makes you think
> > > > that a specific application is running is rather unclean IMO..
> > >
> > > I did not see it this way tbh. I am thinking for devmem tcp to be as
> > > robust as possible to the burstiness of frag frees, we need a bit of a
> > > generous ring size. The specific application I'm referring to is just
> > > an example of how this could happen.
> > >
> > > I was thinking maybe binding->dma_buf->size / net_iov_size (so that
> > > the ring is large enough to hold every single netmem if need be) would
> > > be the upper bound, but in practice increasing to the current max
> > > allowed was good enough, so I'm trying that.
> >
> > Increasing cache sizes to the max seems very hacky at best.
> > The underlying implementation uses genpool and doesn't even
> > bother to do batching.
> >
> 
> OK, my bad. I tried to think through downsides of arbitrarily
> increasing the ring size in a ZC scenario where the underlying memory
> is pre-pinned and allocated anyway, and I couldn't think of any, but I
> won't argue the point any further.
> 
I see a similar issue with io_uring as well: for a 9K MTU with 4K ring
size there are ~1% allocation errors during a simple zcrx test.

mlx5 calculates 16K pages and the io_uring zcrx buffer matches exactly
that size (16K * 4K). Increasing the buffer doesn't help because the
pool size is still what the driver asked for (+ also the
internal pool limit). Even worse: eventually ENOSPC is returned to the
application. But maybe this error has a different fix.

Adapting the pool size to the io_uring buffer size works very well. The
allocation errors are gone and performance is improved.

AFAIU, a page_pool with underlying pre-allocated memory is not really a
cache. So it is useful to be able to adapt to the capacity reserved by
the application.

Maybe one could argue that the zcrx example from liburing could also be
improved. But one thing is sure: aligning the buffer size to the
page_pool size calculated by the driver based on ring size and MTU
is a hassle. If the application provides a large enough buffer, things
should "just work".

> > > > Do you want me to respin the per-ring config series? Or you can take it over.
> > > > IDK where the buffer size config is after recent discussion but IIUC
> > > > it will not drag in my config infra so it shouldn't conflict.
> > >
> > > You mean this one? "[RFC net-next 00/22] net: per-queue rx-buf-len
> > > configuration"
> > >
> > > I don't see the connection between rx-buf-len and the ring size,
> > > unless you're thinking about some netlink-configurable way to
> > > configure the pp->ring size?
> >
> > The latter. We usually have the opposite problem - drivers configure
> > the cache way too large for any practical production needs and waste
> > memory.
> >
> 
> Sounds good, does this sound like roughly what we're looking for here?
> I'm thinking configuring pp->ring size could be simpler than
> rx-buf-len because it's really all used by core, so maybe not
> something we need to bubble all the way down to the driver, so
> something like:
> 
> - We add a new field, netdev_rx_queue[->mp_params?]->pp_ring_size.
> - We add a netlink api to configure the above.
> - When a pp is being created, we check
> netdev_rx_queue[->mp_params]->pp_ring_size, if it's set, then it
> overrides the driver-provided value.
> 
And you would do the last item in page_pool_init() when mp_ops and
pp_ring_size is set?

> Does that make sense?
It does to me. 

I would add that for this case the page_pool limit should not apply at
all anymore. Maybe you thought about this but I didn't see it mentioned.

> I don't immediately see why the driver needs to
> be told the pp_ring_size via the queue API, as it's really needed by
> the pp anyway, no? Or am I missing something?
It doesn't. The only corner case to take care of is when the pool_size
is below what the driver asked for.

Thanks,
Dragos