[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAHS8izP0y1t4LU3nBj4h=3zw126dMtMNHUiXASuqDNyVuyhFYQ@mail.gmail.com>
Date: Wed, 5 Nov 2025 18:56:46 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
Joshua Washington <joshwash@...gle.com>, Harshitha Ramamurthy <hramamurthy@...gle.com>,
Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Simon Horman <horms@...nel.org>, Willem de Bruijn <willemb@...gle.com>, ziweixiao@...gle.com,
Vedant Mathur <vedantmathur@...gle.com>
Subject: Re: [PATCH net v1 2/2] gve: use max allowed ring size for ZC page_pools
On Wed, Nov 5, 2025 at 6:22 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Wed, 5 Nov 2025 17:56:10 -0800 Mina Almasry wrote:
> > On Wed, Nov 5, 2025 at 5:11 PM Jakub Kicinski <kuba@...nel.org> wrote:
> > > On Wed, 5 Nov 2025 20:07:58 +0000 Mina Almasry wrote:
> > > > NCCL workloads with NCCL_P2P_PXN_LEVEL=2 or 1 are very slow with the
> > > > current gve devmem tcp configuration.
> > >
> > > Hardcoding the ring size because some other attribute makes you think
> > > that a specific application is running is rather unclean IMO..
> >
> > I did not see it this way tbh. I am thinking for devmem tcp to be as
> > robust as possible to the burstiness of frag frees, we need a bit of a
> > generous ring size. The specific application I'm referring to is just
> > an example of how this could happen.
> >
> > I was thinking maybe binding->dma_buf->size / net_iov_size (so that
> > the ring is large enough to hold every single netmem if need be) would
> > be the upper bound, but in practice increasing to the current max
> > allowed was good enough, so I'm trying that.
>
> Increasing cache sizes to the max seems very hacky at best.
> The underlying implementation uses genpool and doesn't even
> bother to do batching.
>
OK, my bad. I tried to think through downsides of arbitrarily
increasing the ring size in a ZC scenario where the underlying memory
is pre-pinned and allocated anyway, and I couldn't think of any, but I
won't argue the point any further.
> > > Do you want me to respin the per-ring config series? Or you can take it over.
> > > IDK where the buffer size config is after recent discussion but IIUC
> > > it will not drag in my config infra so it shouldn't conflict.
> >
> > You mean this one? "[RFC net-next 00/22] net: per-queue rx-buf-len
> > configuration"
> >
> > I don't see the connection between rx-buf-len and the ring size,
> > unless you're thinking about some netlink-configurable way to
> > configure the pp->ring size?
>
> The latter. We usually have the opposite problem - drivers configure
> the cache way too large for any practical production needs and waste
> memory.
>
Sounds good, does this sound like roughly what we're looking for here?
I'm thinking configuring pp->ring size could be simpler than
rx-buf-len because it's really all used by core, so maybe not
something we need to bubble all the way down to the driver, so
something like:
- We add a new field, netdev_rx_queue[->mp_params?]->pp_ring_size.
- We add a netlink api to configure the above.
- When a pp is being created, we check
netdev_rx_queue[->mp_params]->pp_ring_size, if it's set, then it
overrides the driver-provided value.
Does that make sense? I don't immediately see why the driver needs to
be told the pp_ring_size via the queue API, as it's really needed by
the pp anyway, no? Or am I missing something?
--
Thanks,
Mina
Powered by blists - more mailing lists