lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izNTtQrRfMCpiK0tN2CadgD7v5pQk9vG9pHB-FkicXGbjw@mail.gmail.com>
Date: Wed, 5 Nov 2025 14:46:40 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Harshitha Ramamurthy <hramamurthy@...gle.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Joshua Washington <joshwash@...gle.com>, Andrew Lunn <andrew+netdev@...n.ch>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, 
	Simon Horman <horms@...nel.org>, Willem de Bruijn <willemb@...gle.com>, ziweixiao@...gle.com, 
	Vedant Mathur <vedantmathur@...gle.com>
Subject: Re: [PATCH net v1 2/2] gve: use max allowed ring size for ZC page_pools

On Wed, Nov 5, 2025 at 2:15 PM Harshitha Ramamurthy
<hramamurthy@...gle.com> wrote:
>
> On Wed, Nov 5, 2025 at 12:08 PM Mina Almasry <almasrymina@...gle.com> wrote:
> >
> > NCCL workloads with NCCL_P2P_PXN_LEVEL=2 or 1 are very slow with the
> > current gve devmem tcp configuration.
> >
> > Root causing showed that this particular workload results in a very
> > bursty pattern of devmem allocations and frees, exhausting the page_pool
> > ring buffer. This results in sock_devmem_dontneed taking up to 5ms to
> > free a batch of 128 netmems, as each free does not find an available
> > entry in the pp->ring, and going all the way down to the (slow) gen_pool,
> > and gve_alloc_buffer running into a burst of successive allocations
> > which also don't find entries in the pp->ring (not dontneed'd yet,
> > presumably), each allocation taking up to 100us, slowing down the napi
> > poll loop.
> >
> > From there, the slowness of the napi poll loop results, I suspect,
> > in the rx buffers not being processed in time, and packet drops
> > detected by tcpdump. The total sum of all this badness results in this
> > workload running at around 0.5 GB/s, when expected perf is around 12
> > GB/s.
> >
> > This entire behavior can be avoided by increasing the pp->ring size to the
> > max allowed 16384. This makes the pp able to handle the bursty
> > alloc/frees of this particular workload. AFACT there should be no
> > negative side effect of arbitrarily increasing the pp->ring size in this
> > manner for ZC configs - the memory is prealloced and pinned by the
> > memory provider anyway.
> >
> > Tested by running AllToAll PXN=2 workload. Before:
> >
> > Avg bus bandwidth    : 0.434191
> >
> > After:
> >
> > Avg bus bandwidth    : 12.5494
> >
> > Note that there is more we can do to optimize this path, such as bulk
> > netmem dontneeds, bulk netmem pp refills, and possibly taking a page
> > from the iouring zcrx playbook and replacing the gen_pool with a simpler
> > fixed-size array based allocator, but this seems sufficient to fix these
> > critcal workloads.
> >
> > With thanks to Willem and Eric for helping root cause this,
> >
> > Cc: ziweixiao@...gle.com
> > Fixes: 62d7f40503bc ("gve: support unreadable netmem")
> > Reported-by: Vedant Mathur <vedantmathur@...gle.com>
> > Signed-off-by: Mina Almasry <almasrymina@...gle.com>
> > ---
> >  drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
> > index 0e2b703c673a..f63ffdd3b3ba 100644
> > --- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
> > +++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
> > @@ -8,6 +8,8 @@
> >  #include "gve.h"
> >  #include "gve_utils.h"
> >
> > +#include "net/netdev_queues.h"
> > +
> >  int gve_buf_ref_cnt(struct gve_rx_buf_state_dqo *bs)
> >  {
> >         return page_count(bs->page_info.page) - bs->page_info.pagecnt_bias;
> > @@ -263,6 +265,8 @@ struct page_pool *gve_rx_create_page_pool(struct gve_priv *priv,
> >         if (priv->header_split_enabled) {
> >                 pp.flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM;
> >                 pp.queue_idx = rx->q_num;
> > +               if  (netif_rxq_has_unreadable_mp(priv->dev, rx->q_num))
> > +                       pp.pool_size = PAGE_POOL_MAX_RING_SIZE;
> >         }
>
> Would it make sense to also wrap setting of the
> PP_FLAG_ALLOW_UNREADABLE_NETMEM under the
> netif_rxq_has_unreadable_mp() helper?

Eh, maybe. PP_FLAG_ALLOW_UNREADABLE_NETMEM is supposed to say 'the
driver supports unreadable netmem, core can enable unreadable netmem
if it wants'. The hope was that the driver would never need to
understand whether it's actually configured for readable or
unreadable.

But then we found that for reasons like this, the driver sometimes
wants to know if it's about to be configured for unreadable, hence
netif_rxq_has_unreadable_mp was added.

I would say this bit is correct as written, but let me know if you disagree.

-- 
Thanks,
Mina

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ