linux-kernel - Re: [PATCH net v1 2/2] gve: use max allowed ring size for ZC page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251106171833.72fe18a9@kernel.org>
Date: Thu, 6 Nov 2025 17:18:33 -0800
From: Jakub Kicinski <kuba@...nel.org>
To: Dragos Tatulea <dtatulea@...dia.com>
Cc: Mina Almasry <almasrymina@...gle.com>, netdev@...r.kernel.org,
 linux-kernel@...r.kernel.org, Joshua Washington <joshwash@...gle.com>,
 Harshitha Ramamurthy <hramamurthy@...gle.com>, Andrew Lunn
 <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, Eric
 Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Jesper
 Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas
 <ilias.apalodimas@...aro.org>, Simon Horman <horms@...nel.org>, Willem de
 Bruijn <willemb@...gle.com>, ziweixiao@...gle.com, Vedant Mathur
 <vedantmathur@...gle.com>
Subject: Re: [PATCH net v1 2/2] gve: use max allowed ring size for ZC
 page_pools

On Thu, 6 Nov 2025 17:25:43 +0000 Dragos Tatulea wrote:
> On Wed, Nov 05, 2025 at 06:56:46PM -0800, Mina Almasry wrote:
> > On Wed, Nov 5, 2025 at 6:22 PM Jakub Kicinski <kuba@...nel.org> wrote:  
> > > Increasing cache sizes to the max seems very hacky at best.
> > > The underlying implementation uses genpool and doesn't even
> > > bother to do batching.
> > 
> > OK, my bad. I tried to think through downsides of arbitrarily
> > increasing the ring size in a ZC scenario where the underlying memory
> > is pre-pinned and allocated anyway, and I couldn't think of any, but I
> > won't argue the point any further.
> >   
> I see a similar issue with io_uring as well: for a 9K MTU with 4K ring
> size there are ~1% allocation errors during a simple zcrx test.
> 
> mlx5 calculates 16K pages and the io_uring zcrx buffer matches exactly
> that size (16K * 4K). Increasing the buffer doesn't help because the
> pool size is still what the driver asked for (+ also the
> internal pool limit). Even worse: eventually ENOSPC is returned to the
> application. But maybe this error has a different fix.

Hm, yes, did you trace it all the way to where it comes from?
page pool itself does not have any ENOSPC AFAICT. If the cache
is full we free the page back to the provider via .release_netmem

> Adapting the pool size to the io_uring buffer size works very well. The
> allocation errors are gone and performance is improved.
> 
> AFAIU, a page_pool with underlying pre-allocated memory is not really a
> cache. So it is useful to be able to adapt to the capacity reserved by
> the application.
> 
> Maybe one could argue that the zcrx example from liburing could also be
> improved. But one thing is sure: aligning the buffer size to the
> page_pool size calculated by the driver based on ring size and MTU
> is a hassle. If the application provides a large enough buffer, things
> should "just work".

Yes, there should be no ENOSPC. I think io_uring is more thorough
in handling the corner cases so what you're describing is more of 
a concern..

Keep in mind that we expect multiple page pools from one provider.
We want the pages to flow back to the MP level so other PPs can grab
them.

> > > The latter. We usually have the opposite problem - drivers configure
> > > the cache way too large for any practical production needs and waste
> > > memory.
> > 
> > Sounds good, does this sound like roughly what we're looking for here?
> > I'm thinking configuring pp->ring size could be simpler than
> > rx-buf-len because it's really all used by core, so maybe not
> > something we need to bubble all the way down to the driver, so
> > something like:
> > 
> > - We add a new field, netdev_rx_queue[->mp_params?]->pp_ring_size.

My series was extending dev->cfg / dev->cfg_pending to also have
per-queue params. So the user config goes there, not in the queue
struct.

> > - We add a netlink api to configure the above.
> > - When a pp is being created, we check
> > netdev_rx_queue[->mp_params]->pp_ring_size, if it's set, then it
> > overrides the driver-provided value.
> 
> And you would do the last item in page_pool_init() when mp_ops and
> pp_ring_size is set?

Yes.

> > Does that make sense?  
> It does to me. 
> 
> I would add that for this case the page_pool limit should not apply at
> all anymore. Maybe you thought about this but I didn't see it mentioned.
> 
> > I don't immediately see why the driver needs to
> > be told the pp_ring_size via the queue API, as it's really needed by
> > the pp anyway, no? Or am I missing something?  
> It doesn't. The only corner case to take care of is when the pool_size
> is below what the driver asked for.

Hm, I can't think why driver would care. Of course if someone sets the
cache size to a tiny value and enables IOMMU=strict they shouldn't
expect great performance..

If a driver appears that has a min I think we can plumb thru the
checking later.