lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEAWyHc4zxC2wKjbO5C8TL6B8Exm6sYQTMxdOihh0PFjFMTkrg@mail.gmail.com>
Date: Wed, 5 Nov 2025 14:15:23 -0800
From: Harshitha Ramamurthy <hramamurthy@...gle.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Joshua Washington <joshwash@...gle.com>, Andrew Lunn <andrew+netdev@...n.ch>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, 
	Simon Horman <horms@...nel.org>, Willem de Bruijn <willemb@...gle.com>, ziweixiao@...gle.com, 
	Vedant Mathur <vedantmathur@...gle.com>
Subject: Re: [PATCH net v1 2/2] gve: use max allowed ring size for ZC page_pools

On Wed, Nov 5, 2025 at 12:08 PM Mina Almasry <almasrymina@...gle.com> wrote:
>
> NCCL workloads with NCCL_P2P_PXN_LEVEL=2 or 1 are very slow with the
> current gve devmem tcp configuration.
>
> Root causing showed that this particular workload results in a very
> bursty pattern of devmem allocations and frees, exhausting the page_pool
> ring buffer. This results in sock_devmem_dontneed taking up to 5ms to
> free a batch of 128 netmems, as each free does not find an available
> entry in the pp->ring, and going all the way down to the (slow) gen_pool,
> and gve_alloc_buffer running into a burst of successive allocations
> which also don't find entries in the pp->ring (not dontneed'd yet,
> presumably), each allocation taking up to 100us, slowing down the napi
> poll loop.
>
> From there, the slowness of the napi poll loop results, I suspect,
> in the rx buffers not being processed in time, and packet drops
> detected by tcpdump. The total sum of all this badness results in this
> workload running at around 0.5 GB/s, when expected perf is around 12
> GB/s.
>
> This entire behavior can be avoided by increasing the pp->ring size to the
> max allowed 16384. This makes the pp able to handle the bursty
> alloc/frees of this particular workload. AFACT there should be no
> negative side effect of arbitrarily increasing the pp->ring size in this
> manner for ZC configs - the memory is prealloced and pinned by the
> memory provider anyway.
>
> Tested by running AllToAll PXN=2 workload. Before:
>
> Avg bus bandwidth    : 0.434191
>
> After:
>
> Avg bus bandwidth    : 12.5494
>
> Note that there is more we can do to optimize this path, such as bulk
> netmem dontneeds, bulk netmem pp refills, and possibly taking a page
> from the iouring zcrx playbook and replacing the gen_pool with a simpler
> fixed-size array based allocator, but this seems sufficient to fix these
> critcal workloads.
>
> With thanks to Willem and Eric for helping root cause this,
>
> Cc: ziweixiao@...gle.com
> Fixes: 62d7f40503bc ("gve: support unreadable netmem")
> Reported-by: Vedant Mathur <vedantmathur@...gle.com>
> Signed-off-by: Mina Almasry <almasrymina@...gle.com>
> ---
>  drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
> index 0e2b703c673a..f63ffdd3b3ba 100644
> --- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
> +++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
> @@ -8,6 +8,8 @@
>  #include "gve.h"
>  #include "gve_utils.h"
>
> +#include "net/netdev_queues.h"
> +
>  int gve_buf_ref_cnt(struct gve_rx_buf_state_dqo *bs)
>  {
>         return page_count(bs->page_info.page) - bs->page_info.pagecnt_bias;
> @@ -263,6 +265,8 @@ struct page_pool *gve_rx_create_page_pool(struct gve_priv *priv,
>         if (priv->header_split_enabled) {
>                 pp.flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM;
>                 pp.queue_idx = rx->q_num;
> +               if  (netif_rxq_has_unreadable_mp(priv->dev, rx->q_num))
> +                       pp.pool_size = PAGE_POOL_MAX_RING_SIZE;
>         }

Would it make sense to also wrap setting of the
PP_FLAG_ALLOW_UNREADABLE_NETMEM under the
netif_rxq_has_unreadable_mp() helper?
>
>         return page_pool_create(&pp);
> --
> 2.51.2.1026.g39e6a42477-goog
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ