[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aSCV1ihk4pZoBDeI@mini-arch>
Date: Fri, 21 Nov 2025 08:39:50 -0800
From: Stanislav Fomichev <stfomichev@...il.com>
To: Bobby Eshleman <bobbyeshleman@...il.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>,
Willem de Bruijn <willemb@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>,
David Ahern <dsahern@...nel.org>, Arnd Bergmann <arnd@...db.de>,
Jonathan Corbet <corbet@....net>,
Andrew Lunn <andrew+netdev@...n.ch>, Shuah Khan <shuah@...nel.org>,
Donald Hunter <donald.hunter@...il.com>,
Mina Almasry <almasrymina@...gle.com>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
linux-doc@...r.kernel.org, linux-kselftest@...r.kernel.org,
Stanislav Fomichev <sdf@...ichev.me>,
Bobby Eshleman <bobbyeshleman@...a.com>
Subject: Re: [PATCH net-next v7 3/5] net: devmem: implement autorelease token
management
On 11/19, Bobby Eshleman wrote:
> From: Bobby Eshleman <bobbyeshleman@...a.com>
>
> Add support for autorelease toggling of tokens using a static branch to
> control system-wide behavior. This allows applications to choose between
> two memory management modes:
>
> 1. Autorelease on: Leaked tokens are automatically released when the
> socket closes.
>
> 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
>
> The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> binding is disallowed and is rejected by netlink. The system will be
> "locked" into the mode that the first binding is set to. It can only be
> changed again once there are zero bindings on the system.
>
> Disabling autorelease offers ~13% improvement in CPU utilization.
>
> Static branching is used to limit the system to one mode or the other.
>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@...a.com>
> ---
> Changes in v7:
> - implement autorelease with static branch (Stan)
> - use netlink instead of sockopt (Stan)
> - merge uAPI and implementation patches into one patch (seemed less
> confusing)
>
> Changes in v6:
> - remove sk_devmem_info.autorelease, using binding->autorelease instead
> - move binding->autorelease check to outside of
> net_devmem_dmabuf_binding_put_urefs() (Mina)
> - remove overly defensive net_is_devmem_iov() (Mina)
> - add comment about multiple urefs mapping to a single netmem ref (Mina)
> - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> - use niov without casting back and forth with netmem (Mina)
> - move the autorelease flag from per-binding to per-socket (Mina)
> - remove the batching logic in sock_devmem_dontneed_manual_release()
> (Mina)
> - move autorelease check inside tcp_xa_pool_commit() (Mina)
> - remove single-binding restriction for autorelease mode (Mina)
> - unbind always checks for leaked urefs
>
> Changes in v5:
> - remove unused variables
> - introduce autorelease flag, preparing for future patch toggle new
> behavior
>
> Changes in v3:
> - make urefs per-binding instead of per-socket, reducing memory
> footprint
> - fallback to cleaning up references in dmabuf unbind if socket leaked
> tokens
> - drop ethtool patch
>
> Changes in v2:
> - always use GFP_ZERO for binding->vec (Mina)
> - remove WARN for changed binding (Mina)
> - remove extraneous binding ref get (Mina)
> - remove WARNs on invalid user input (Mina)
> - pre-assign niovs in binding->vec for RX case (Mina)
> - use atomic_set(, 0) to initialize sk_user_frags.urefs
> - fix length of alloc for urefs
> ---
> Documentation/netlink/specs/netdev.yaml | 12 ++++
> include/net/netmem.h | 1 +
> include/net/sock.h | 7 +-
> include/uapi/linux/netdev.h | 1 +
> net/core/devmem.c | 109 +++++++++++++++++++++++++++-----
> net/core/devmem.h | 11 +++-
> net/core/netdev-genl-gen.c | 5 +-
> net/core/netdev-genl.c | 13 +++-
> net/core/sock.c | 57 +++++++++++++++--
> net/ipv4/tcp.c | 78 ++++++++++++++++++-----
> net/ipv4/tcp_ipv4.c | 13 +++-
> net/ipv4/tcp_minisocks.c | 3 +-
> tools/include/uapi/linux/netdev.h | 1 +
> 13 files changed, 262 insertions(+), 49 deletions(-)
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index 82bf5cb2617d..913fccca4c4e 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -562,6 +562,17 @@ attribute-sets:
> type: u32
> checks:
> min: 1
> + -
> + name: autorelease
> + doc: |
> + Token autorelease mode. If true (1), leaked tokens are automatically
> + released when the socket closes. If false (0), leaked tokens are only
> + released when the dmabuf is unbound. Once a binding is created with a
> + specific mode, all subsequent bindings system-wide must use the same
> + mode.
> +
> + Optional. Defaults to false if not specified.
> + type: u8
>
> operations:
> list:
> @@ -767,6 +778,7 @@ operations:
> - ifindex
> - fd
> - queues
> + - autorelease
> reply:
> attributes:
> - id
> diff --git a/include/net/netmem.h b/include/net/netmem.h
> index 9e10f4ac50c3..80d2263ba4ed 100644
> --- a/include/net/netmem.h
> +++ b/include/net/netmem.h
> @@ -112,6 +112,7 @@ struct net_iov {
> };
> struct net_iov_area *owner;
> enum net_iov_type type;
> + atomic_t uref;
> };
>
> struct net_iov_area {
> diff --git a/include/net/sock.h b/include/net/sock.h
> index a5f36ea9d46f..797b21148945 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -350,7 +350,7 @@ struct sk_filter;
> * @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> * @sk_scm_unused: unused flags for scm_recv()
> * @ns_tracker: tracker for netns reference
> - * @sk_user_frags: xarray of pages the user is holding a reference on.
> + * @sk_devmem_info: the devmem binding information for the socket
> * @sk_owner: reference to the real owner of the socket that calls
> * sock_lock_init_class_and_name().
> */
> @@ -579,7 +579,10 @@ struct sock {
> struct numa_drop_counters *sk_drop_counters;
> struct rcu_head sk_rcu;
> netns_tracker ns_tracker;
> - struct xarray sk_user_frags;
> + struct {
> + struct xarray frags;
> + struct net_devmem_dmabuf_binding *binding;
> + } sk_devmem_info;
>
> #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> struct module *sk_owner;
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 048c8de1a130..dff0be8223a4 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -206,6 +206,7 @@ enum {
> NETDEV_A_DMABUF_QUEUES,
> NETDEV_A_DMABUF_FD,
> NETDEV_A_DMABUF_ID,
> + NETDEV_A_DMABUF_AUTORELEASE,
>
> __NETDEV_A_DMABUF_MAX,
> NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 4dee2666dd07..bba21c6cb195 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -11,6 +11,7 @@
> #include <linux/genalloc.h>
> #include <linux/mm.h>
> #include <linux/netdevice.h>
> +#include <linux/skbuff_ref.h>
> #include <linux/types.h>
> #include <net/netdev_queues.h>
> #include <net/netdev_rx_queue.h>
> @@ -28,6 +29,17 @@
>
> static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
>
> +/* Static key to lock down autorelease to a single mode on a system. When
> + * enabled: autorelease mode (leaked tokens released on socket close). When
> + * disabled: manual mode (leaked tokens released on dmabuf unbind). Once the
> + * first binding is created, the mode is locked system-wide and can not be
> + * changed until the system has zero bindings again.
> + *
> + * Protected by xa_lock of net_devmem_dmabuf_bindings.
> + */
> +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> +EXPORT_SYMBOL(tcp_devmem_ar_key);
> +
> static const struct memory_provider_ops dmabuf_devmem_ops;
>
> bool net_is_devmem_iov(struct net_iov *niov)
> @@ -116,6 +128,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> }
>
> +static void
> +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> +{
> + int i;
> +
> + for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> + struct net_iov *niov;
> + netmem_ref netmem;
> +
> + niov = binding->vec[i];
> + netmem = net_iov_to_netmem(niov);
> +
> + /* Multiple urefs map to only a single netmem ref. */
> + if (atomic_xchg(&niov->uref, 0) > 0)
> + WARN_ON_ONCE(!napi_pp_put_page(netmem));
> + }
> +}
> +
> void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> {
> struct netdev_rx_queue *rxq;
> @@ -143,6 +173,10 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> }
>
> + /* Clean up any lingering urefs from sockets that had autorelease
> + * disabled.
> + */
> + net_devmem_dmabuf_binding_put_urefs(binding);
> net_devmem_dmabuf_binding_put(binding);
> }
>
> @@ -179,8 +213,10 @@ struct net_devmem_dmabuf_binding *
> net_devmem_bind_dmabuf(struct net_device *dev,
> struct device *dma_dev,
> enum dma_data_direction direction,
> - unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> - struct netlink_ext_ack *extack)
> + unsigned int dmabuf_fd,
> + struct netdev_nl_sock *priv,
> + struct netlink_ext_ack *extack,
> + bool autorelease)
> {
> struct net_devmem_dmabuf_binding *binding;
> static u32 id_alloc_next;
> @@ -231,14 +267,13 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> goto err_detach;
> }
>
> - if (direction == DMA_TO_DEVICE) {
> - binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> - sizeof(struct net_iov *),
> - GFP_KERNEL);
> - if (!binding->vec) {
> - err = -ENOMEM;
> - goto err_unmap;
> - }
> + /* Used by TX and also by RX when socket has autorelease disabled */
> + binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> + sizeof(struct net_iov *),
> + GFP_KERNEL | __GFP_ZERO);
> + if (!binding->vec) {
> + err = -ENOMEM;
> + goto err_unmap;
> }
>
> /* For simplicity we expect to make PAGE_SIZE allocations, but the
> @@ -292,25 +327,67 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> niov = &owner->area.niovs[i];
> niov->type = NET_IOV_DMABUF;
> niov->owner = &owner->area;
> + atomic_set(&niov->uref, 0);
> page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
> net_devmem_get_dma_addr(niov));
> - if (direction == DMA_TO_DEVICE)
> - binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> + binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> }
>
> virtual += len;
> }
>
> - err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
> - binding, xa_limit_32b, &id_alloc_next,
> - GFP_KERNEL);
> + /* Enforce system-wide autorelease mode consistency for RX bindings.
> + * TX bindings don't use autorelease (always false) since tokens aren't
> + * leaked in TX path. Only RX bindings must all have the same
> + * autorelease mode, never mixed.
> + *
> + * We use the xarray's lock to atomically check xa_empty() and toggle
> + * the static key, avoiding the race where two new bindings may try to
> + * set the static key to different states.
> + */
> + xa_lock(&net_devmem_dmabuf_bindings);
> +
> + if (direction == DMA_FROM_DEVICE) {
> + if (!xa_empty(&net_devmem_dmabuf_bindings)) {
> + bool mode;
> +
> + mode = static_key_enabled(&tcp_devmem_ar_key);
> +
> + /* When bindings exist, enforce that the mode does not
> + * change.
> + */
> + if (mode != autorelease) {
> + NL_SET_ERR_MSG_FMT(extack,
> + "System already configured with autorelease=%d",
> + mode);
> + err = -EINVAL;
> + goto err_unlock_xa;
> + }
> + } else {
> + /* First binding sets the mode for all subsequent
> + * bindings.
> + */
> + if (autorelease)
> + static_branch_enable(&tcp_devmem_ar_key);
[..]
> + else
> + static_branch_disable(&tcp_devmem_ar_key);
nit: don't think this is needed? DEFINE_STATIC_KEY_FALSE should already
have it disabled by default iiuc.
I was also expecting to see something in net_devmem_unbind_dmabuf to undo
this static_branch_enable(tcp_devmem_ar_key) and bring the system
back into the default state, but I can't find it. Am I missing
something?
The rest looks good to me, thanks! Let's see if Mina sees any problems
with this approach. IMO, it's a bit easier to reason about two separate
code paths now.
Powered by blists - more mailing lists