[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izPrf1b_H_FNu2JtnMVpaD6SwHGvg6bC=Fjd4zfp=-pd6w@mail.gmail.com>
Date: Wed, 3 Sep 2025 13:20:57 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Bobby Eshleman <bobbyeshleman@...il.com>
Cc: "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>, David Ahern <dsahern@...nel.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, Stanislav Fomichev <sdf@...ichev.me>,
Bobby Eshleman <bobbyeshleman@...a.com>
Subject: Re: [PATCH net-next 2/2] net: devmem: use niov array for token management
On Tue, Sep 2, 2025 at 2:36 PM Bobby Eshleman <bobbyeshleman@...il.com> wrote:
>
> From: Bobby Eshleman <bobbyeshleman@...a.com>
>
> Improve CPU performance of devmem token management by using page offsets
> as dmabuf tokens and using them for direct array access lookups instead
> of xarray lookups. Consequently, the xarray can be removed. The result
> is an average 5% reduction in CPU cycles spent by devmem RX user
> threads.
>
Great!
> This patch changes the meaning of tokens. Tokens previously referred to
> unique fragments of pages. In this patch tokens instead represent
> references to pages, not fragments. Because of this, multiple tokens
> may refer to the same page and so have identical value (e.g., two small
> fragments may coexist on the same page). The token and offset pair that
> the user receives uniquely identifies fragments if needed. This assumes
> that the user is not attempting to sort / uniq the token list using
> tokens alone.
>
> A new restriction is added to the implementation: devmem RX sockets
> cannot switch dmabuf bindings. In practice, this is a symptom of invalid
> configuration as a flow would have to be steered to a different queue or
> device where there is a different binding, which is generally bad for
> TCP flows.
Please do not assume configurations you don't use/care about are
invalid. Currently reconfiguring flow steering while a flow is active
works as intended today. This is a regression that needs to be
resolved. But more importantly, looking at your code, I don't think
this is a restriction you need to introduce?
> This restriction is necessary because the 32-bit dmabuf token
> does not have enough bits to represent both the pages in a large dmabuf
> and also a binding or dmabuf ID. For example, a system with 8 NICs and
> 32 queues requires 8 bits for a binding / queue ID (8 NICs * 32 queues
> == 256 queues total == 2^8), which leaves only 24 bits for dmabuf pages
> (2^24 * 4096 / (1<<30) == 64GB). This is insufficient for the device and
> queue numbers on many current systems or systems that may need larger
> GPU dmabufs (as for hard limits, my current H100 has 80GB GPU memory per
> device).
>
> Using kperf[1] with 4 flows and workers, this patch improves receive
> worker CPU util by ~4.9% with slightly better throughput.
>
> Before, mean cpu util for rx workers ~83.6%:
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 4 2.30 0.00 79.43 0.00 0.65 0.21 0.00 0.00 0.00 17.41
> Average: 5 2.27 0.00 80.40 0.00 0.45 0.21 0.00 0.00 0.00 16.67
> Average: 6 2.28 0.00 80.47 0.00 0.46 0.25 0.00 0.00 0.00 16.54
> Average: 7 2.42 0.00 82.05 0.00 0.46 0.21 0.00 0.00 0.00 14.86
>
> After, mean cpu util % for rx workers ~78.7%:
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 4 2.61 0.00 73.31 0.00 0.76 0.11 0.00 0.00 0.00 23.20
> Average: 5 2.95 0.00 74.24 0.00 0.66 0.22 0.00 0.00 0.00 21.94
> Average: 6 2.81 0.00 73.38 0.00 0.97 0.11 0.00 0.00 0.00 22.73
> Average: 7 3.05 0.00 78.76 0.00 0.76 0.11 0.00 0.00 0.00 17.32
>
I think effectively all you're doing in this patch is removing xarray
with a regular array, right? I'm surprised an xarray account for 5%
cpu utilization. I wonder if you have debug configs turned on during
these experiments. Can you perf trace what about the xarray is taking
so long? I wonder if we're just using xarrays improperly (maybe
hitting constant resizing slow paths or something), and a similar
improvement can be gotten by adjusting the xarray flags or what not.
> Mean throughput improves, but falls within a standard deviation (~45GB/s
> for 4 flows on a 50GB/s NIC, one hop).
>
> This patch adds an array of atomics for counting the tokens returned to
> the user for a given page. There is a 4-byte atomic per page in the
> dmabuf per socket. Given a 2GB dmabuf, this array is 2MB.
>
> [1]: https://github.com/facebookexperimental/kperf
>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@...a.com>
> ---
> include/net/sock.h | 5 ++-
> net/core/devmem.c | 17 ++++----
> net/core/devmem.h | 2 +-
> net/core/sock.c | 24 +++++++----
> net/ipv4/tcp.c | 107 +++++++++++++++--------------------------------
> net/ipv4/tcp_ipv4.c | 40 +++++++++++++++---
> net/ipv4/tcp_minisocks.c | 2 -
> 7 files changed, 99 insertions(+), 98 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 1e7f124871d2..70c97880229d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -573,7 +573,10 @@ struct sock {
> #endif
> struct rcu_head sk_rcu;
> netns_tracker ns_tracker;
> - struct xarray sk_user_frags;
> + struct {
> + struct net_devmem_dmabuf_binding *binding;
> + atomic_t *urefs;
> + } sk_user_frags;
>
AFAIU, if you made sk_user_frags an array of (unref, binding) tuples
instead of just an array of urefs then you can remove the
single-binding restriction.
Although, I wonder what happens if the socket receives the netmem at
the same index on 2 different dmabufs. At that point I assume the
wrong uref gets incremented? :(
One way or another the single-binding restriction needs to be removed
I think. It's regressing a UAPI that currently works.
> #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> struct module *sk_owner;
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index b4c570d4f37a..50e92dcf5bf1 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -187,6 +187,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> struct dma_buf *dmabuf;
> unsigned int sg_idx, i;
> unsigned long virtual;
> + gfp_t flags;
> int err;
>
> if (!dma_dev) {
> @@ -230,14 +231,14 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> goto err_detach;
> }
>
> - if (direction == DMA_TO_DEVICE) {
> - binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> - sizeof(struct net_iov *),
> - GFP_KERNEL);
> - if (!binding->vec) {
> - err = -ENOMEM;
> - goto err_unmap;
> - }
> + flags = (direction == DMA_FROM_DEVICE) ? __GFP_ZERO : 0;
> +
Why not pass __GFP_ZERO unconditionally?
> + binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> + sizeof(struct net_iov *),
> + GFP_KERNEL | flags);
> + if (!binding->vec) {
> + err = -ENOMEM;
> + goto err_unmap;
> }
>
> /* For simplicity we expect to make PAGE_SIZE allocations, but the
> diff --git a/net/core/devmem.h b/net/core/devmem.h
> index 2ada54fb63d7..d4eb28d079bb 100644
> --- a/net/core/devmem.h
> +++ b/net/core/devmem.h
> @@ -61,7 +61,7 @@ struct net_devmem_dmabuf_binding {
>
> /* Array of net_iov pointers for this binding, sorted by virtual
> * address. This array is convenient to map the virtual addresses to
> - * net_iovs in the TX path.
> + * net_iovs.
> */
> struct net_iov **vec;
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 9a8290fcc35d..3a5cb4e10519 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -87,6 +87,7 @@
>
> #include <linux/unaligned.h>
> #include <linux/capability.h>
> +#include <linux/dma-buf.h>
> #include <linux/errno.h>
> #include <linux/errqueue.h>
> #include <linux/types.h>
> @@ -151,6 +152,7 @@
> #include <uapi/linux/pidfd.h>
>
> #include "dev.h"
> +#include "devmem.h"
>
> static DEFINE_MUTEX(proto_list_mutex);
> static LIST_HEAD(proto_list);
> @@ -1100,32 +1102,40 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> return -EFAULT;
> }
>
> - xa_lock_bh(&sk->sk_user_frags);
> for (i = 0; i < num_tokens; i++) {
> for (j = 0; j < tokens[i].token_count; j++) {
> + struct net_iov *niov;
> + unsigned int token;
> + netmem_ref netmem;
> +
> + token = tokens[i].token_start + j;
> + if (WARN_ONCE(token >= sk->sk_user_frags.binding->dmabuf->size / PAGE_SIZE,
> + "invalid token passed from user"))
> + break;
WARNs on invalid user behavior are a non-starter AFAIU. For one,
syzbot trivially reproduces them and files a bug. Please remove all of
them. pr_err may be acceptable for extremely bad errors. Invalid user
input is not worthy of WARN or pr_err.
> +
> if (++num_frags > MAX_DONTNEED_FRAGS)
> goto frag_limit_reached;
> -
> - netmem_ref netmem = (__force netmem_ref)__xa_erase(
> - &sk->sk_user_frags, tokens[i].token_start + j);
> + niov = sk->sk_user_frags.binding->vec[token];
> + netmem = net_iov_to_netmem(niov);
So token is the index to both vec and sk->sk_user_frags.binding->vec?
xarrays are a resizable array. AFAIU what you're doing abstractly here
is replacing a resizable array with an array of max size, no? (I
didn't read too closely yet, I may be missing something).
Which makes me think either due to a bug or due to specifics of your
setup, xarray is unreasonably expensive. Without investigating the
details I wonder if we're constantly running into a resizing slowpath
in xarray code and I think this needs some investigation.
>
> if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> continue;
>
> + if (WARN_ONCE(atomic_dec_if_positive(&sk->sk_user_frags.urefs[token])
> + < 0, "user released token too many times"))
Here and everywhere, please remove the WARNs for weird user behavior.
> + continue;
> +
> netmems[netmem_num++] = netmem;
> if (netmem_num == ARRAY_SIZE(netmems)) {
> - xa_unlock_bh(&sk->sk_user_frags);
> for (k = 0; k < netmem_num; k++)
> WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> netmem_num = 0;
> - xa_lock_bh(&sk->sk_user_frags);
You remove the locking but it's not clear to me why we don't need it
anymore. What's stopping 2 dontneeds from freeing at the same time?
I'm guessing it's because urefs are atomic so we don't need any extra
sync?
> }
> ret++;
> }
> }
>
> frag_limit_reached:
> - xa_unlock_bh(&sk->sk_user_frags);
> for (k = 0; k < netmem_num; k++)
> WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 40b774b4f587..585b50fa8c00 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -261,6 +261,7 @@
> #include <linux/memblock.h>
> #include <linux/highmem.h>
> #include <linux/cache.h>
> +#include <linux/dma-buf.h>
> #include <linux/err.h>
> #include <linux/time.h>
> #include <linux/slab.h>
> @@ -475,7 +476,8 @@ void tcp_init_sock(struct sock *sk)
>
> set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
> sk_sockets_allocated_inc(sk);
> - xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
> + sk->sk_user_frags.binding = NULL;
> + sk->sk_user_frags.urefs = NULL;
> }
> EXPORT_IPV6_MOD(tcp_init_sock);
>
> @@ -2386,68 +2388,6 @@ static int tcp_inq_hint(struct sock *sk)
> return inq;
> }
>
> -/* batch __xa_alloc() calls and reduce xa_lock()/xa_unlock() overhead. */
> -struct tcp_xa_pool {
> - u8 max; /* max <= MAX_SKB_FRAGS */
> - u8 idx; /* idx <= max */
> - __u32 tokens[MAX_SKB_FRAGS];
> - netmem_ref netmems[MAX_SKB_FRAGS];
> -};
> -
> -static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
> -{
> - int i;
> -
> - /* Commit part that has been copied to user space. */
> - for (i = 0; i < p->idx; i++)
> - __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
> - (__force void *)p->netmems[i], GFP_KERNEL);
> - /* Rollback what has been pre-allocated and is no longer needed. */
> - for (; i < p->max; i++)
> - __xa_erase(&sk->sk_user_frags, p->tokens[i]);
> -
> - p->max = 0;
> - p->idx = 0;
> -}
> -
> -static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
> -{
> - if (!p->max)
> - return;
> -
> - xa_lock_bh(&sk->sk_user_frags);
> -
> - tcp_xa_pool_commit_locked(sk, p);
> -
> - xa_unlock_bh(&sk->sk_user_frags);
> -}
> -
> -static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> - unsigned int max_frags)
> -{
> - int err, k;
> -
> - if (p->idx < p->max)
> - return 0;
> -
> - xa_lock_bh(&sk->sk_user_frags);
> -
> - tcp_xa_pool_commit_locked(sk, p);
> -
> - for (k = 0; k < max_frags; k++) {
> - err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
> - XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
> - if (err)
> - break;
> - }
> -
> - xa_unlock_bh(&sk->sk_user_frags);
> -
> - p->max = k;
> - p->idx = 0;
> - return k ? 0 : err;
> -}
> -
> /* On error, returns the -errno. On success, returns number of bytes sent to the
> * user. May not consume all of @remaining_len.
> */
> @@ -2456,14 +2396,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> int remaining_len)
> {
> struct dmabuf_cmsg dmabuf_cmsg = { 0 };
> - struct tcp_xa_pool tcp_xa_pool;
> unsigned int start;
> int i, copy, n;
> int sent = 0;
> int err = 0;
>
> - tcp_xa_pool.max = 0;
> - tcp_xa_pool.idx = 0;
> do {
> start = skb_headlen(skb);
>
> @@ -2510,8 +2447,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> */
> for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> + struct net_devmem_dmabuf_binding *binding;
> struct net_iov *niov;
> u64 frag_offset;
> + size_t size;
> + u32 token;
> int end;
>
> /* !skb_frags_readable() should indicate that ALL the
> @@ -2544,13 +2484,35 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> start;
> dmabuf_cmsg.frag_offset = frag_offset;
> dmabuf_cmsg.frag_size = copy;
> - err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
> - skb_shinfo(skb)->nr_frags - i);
> - if (err)
> +
> + binding = net_devmem_iov_binding(niov);
> +
> + if (!sk->sk_user_frags.binding) {
> + sk->sk_user_frags.binding = binding;
> +
> + size = binding->dmabuf->size / PAGE_SIZE;
> + sk->sk_user_frags.urefs = kzalloc(size,
> + GFP_KERNEL);
> + if (!sk->sk_user_frags.urefs) {
> + sk->sk_user_frags.binding = NULL;
> + err = -ENOMEM;
> + goto out;
> + }
> +
> + net_devmem_dmabuf_binding_get(binding);
It's not clear to me why we need to get the binding. AFAIR the way it
works is that we grab a reference on the net_iov, which guarantees
that the associated pp is alive, which in turn guarantees that the
binding remains alive and we don't need to get the binding for every
frag.
> + }
> +
> + if (WARN_ONCE(sk->sk_user_frags.binding != binding,
> + "binding changed for devmem socket")) {
Remove WARN_ONCE. It's not reasonable to kernel splat because the user
reconfigured flow steering me thinks.
> + err = -EFAULT;
> goto out;
> + }
> +
> + token = net_iov_virtual_addr(niov) >> PAGE_SHIFT;
> + binding->vec[token] = niov;
I don't think you can do this? I thought vec[token] was already == niov.
binding->vec should be initialized when teh binding is initialized,
not re-written on every recvmsg, no?
> + dmabuf_cmsg.frag_token = token;
>
> /* Will perform the exchange later */
> - dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
> dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
>
> offset += copy;
> @@ -2563,8 +2525,9 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> if (err)
> goto out;
>
> + atomic_inc(&sk->sk_user_frags.urefs[token]);
> +
> atomic_long_inc(&niov->pp_ref_count);
> - tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
>
> sent += copy;
>
> @@ -2574,7 +2537,6 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> start = end;
> }
>
> - tcp_xa_pool_commit(sk, &tcp_xa_pool);
> if (!remaining_len)
> goto out;
>
> @@ -2592,7 +2554,6 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> }
>
> out:
> - tcp_xa_pool_commit(sk, &tcp_xa_pool);
> if (!sent)
> sent = err;
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 1e58a8a9ff7a..bdcb8cc003af 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -87,6 +87,9 @@
> #include <crypto/hash.h>
> #include <linux/scatterlist.h>
>
> +#include <linux/dma-buf.h>
> +#include "../core/devmem.h"
> +
> #include <trace/events/tcp.h>
>
> #ifdef CONFIG_TCP_MD5SIG
> @@ -2529,11 +2532,38 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head)
> static void tcp_release_user_frags(struct sock *sk)
> {
> #ifdef CONFIG_PAGE_POOL
> - unsigned long index;
> - void *netmem;
> + struct net_devmem_dmabuf_binding *binding;
> + struct net_iov *niov;
> + unsigned int token;
> + netmem_ref netmem;
> +
> + if (!sk->sk_user_frags.urefs)
> + return;
> +
> + binding = sk->sk_user_frags.binding;
> + if (!binding || !binding->vec)
> + return;
> +
> + for (token = 0; token < binding->dmabuf->size / PAGE_SIZE; token++) {
> + niov = binding->vec[token];
> +
> + /* never used by recvmsg() */
> + if (!niov)
> + continue;
> +
> + if (!net_is_devmem_iov(niov))
> + continue;
> +
> + netmem = net_iov_to_netmem(niov);
>
> - xa_for_each(&sk->sk_user_frags, index, netmem)
> - WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
> + while (atomic_dec_return(&sk->sk_user_frags.urefs[token]) >= 0)
> + WARN_ON_ONCE(!napi_pp_put_page(netmem));
> + }
> +
> + net_devmem_dmabuf_binding_put(binding);
> + sk->sk_user_frags.binding = NULL;
> + kvfree(sk->sk_user_frags.urefs);
> + sk->sk_user_frags.urefs = NULL;
> #endif
> }
>
> @@ -2543,8 +2573,6 @@ void tcp_v4_destroy_sock(struct sock *sk)
>
> tcp_release_user_frags(sk);
>
> - xa_destroy(&sk->sk_user_frags);
> -
> trace_tcp_destroy_sock(sk);
>
> tcp_clear_xmit_timers(sk);
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index d1c9e4088646..4e8ea73daab7 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -639,8 +639,6 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>
> __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
>
> - xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
> -
> return newsk;
> }
> EXPORT_SYMBOL(tcp_create_openreq_child);
>
> --
> 2.47.3
>
--
Thanks,
Mina
Powered by blists - more mailing lists