lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251105200801.178381-2-almasrymina@google.com>
Date: Wed,  5 Nov 2025 20:07:58 +0000
From: Mina Almasry <almasrymina@...gle.com>
To: netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Cc: Mina Almasry <almasrymina@...gle.com>, Joshua Washington <joshwash@...gle.com>, 
	Harshitha Ramamurthy <hramamurthy@...gle.com>, Andrew Lunn <andrew+netdev@...n.ch>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, 
	Simon Horman <horms@...nel.org>, Willem de Bruijn <willemb@...gle.com>, ziweixiao@...gle.com, 
	Vedant Mathur <vedantmathur@...gle.com>
Subject: [PATCH net v1 2/2] gve: use max allowed ring size for ZC page_pools

NCCL workloads with NCCL_P2P_PXN_LEVEL=2 or 1 are very slow with the
current gve devmem tcp configuration.

Root causing showed that this particular workload results in a very
bursty pattern of devmem allocations and frees, exhausting the page_pool
ring buffer. This results in sock_devmem_dontneed taking up to 5ms to
free a batch of 128 netmems, as each free does not find an available
entry in the pp->ring, and going all the way down to the (slow) gen_pool,
and gve_alloc_buffer running into a burst of successive allocations
which also don't find entries in the pp->ring (not dontneed'd yet,
presumably), each allocation taking up to 100us, slowing down the napi
poll loop.

>From there, the slowness of the napi poll loop results, I suspect,
in the rx buffers not being processed in time, and packet drops
detected by tcpdump. The total sum of all this badness results in this
workload running at around 0.5 GB/s, when expected perf is around 12
GB/s.

This entire behavior can be avoided by increasing the pp->ring size to the
max allowed 16384. This makes the pp able to handle the bursty
alloc/frees of this particular workload. AFACT there should be no
negative side effect of arbitrarily increasing the pp->ring size in this
manner for ZC configs - the memory is prealloced and pinned by the
memory provider anyway.

Tested by running AllToAll PXN=2 workload. Before:

Avg bus bandwidth    : 0.434191

After:

Avg bus bandwidth    : 12.5494

Note that there is more we can do to optimize this path, such as bulk
netmem dontneeds, bulk netmem pp refills, and possibly taking a page
from the iouring zcrx playbook and replacing the gen_pool with a simpler
fixed-size array based allocator, but this seems sufficient to fix these
critcal workloads.

With thanks to Willem and Eric for helping root cause this,

Cc: ziweixiao@...gle.com
Fixes: 62d7f40503bc ("gve: support unreadable netmem")
Reported-by: Vedant Mathur <vedantmathur@...gle.com>
Signed-off-by: Mina Almasry <almasrymina@...gle.com>
---
 drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
index 0e2b703c673a..f63ffdd3b3ba 100644
--- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
@@ -8,6 +8,8 @@
 #include "gve.h"
 #include "gve_utils.h"
 
+#include "net/netdev_queues.h"
+
 int gve_buf_ref_cnt(struct gve_rx_buf_state_dqo *bs)
 {
 	return page_count(bs->page_info.page) - bs->page_info.pagecnt_bias;
@@ -263,6 +265,8 @@ struct page_pool *gve_rx_create_page_pool(struct gve_priv *priv,
 	if (priv->header_split_enabled) {
 		pp.flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM;
 		pp.queue_idx = rx->q_num;
+		if  (netif_rxq_has_unreadable_mp(priv->dev, rx->q_num))
+			pp.pool_size = PAGE_POOL_MAX_RING_SIZE;
 	}
 
 	return page_pool_create(&pp);
-- 
2.51.2.1026.g39e6a42477-goog


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ