[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWUgNd6nOzZY3JCJ@devvm11784.nha0.facebook.com>
Date: Mon, 12 Jan 2026 08:24:21 -0800
From: Bobby Eshleman <bobbyeshleman@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>,
Willem de Bruijn <willemb@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>,
David Ahern <dsahern@...nel.org>, Arnd Bergmann <arnd@...db.de>,
Jonathan Corbet <corbet@....net>,
Andrew Lunn <andrew+netdev@...n.ch>, Shuah Khan <shuah@...nel.org>,
Donald Hunter <donald.hunter@...il.com>,
Stanislav Fomichev <sdf@...ichev.me>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
linux-doc@...r.kernel.org, linux-kselftest@...r.kernel.org,
asml.silence@...il.com, matttbe@...nel.org, skhawaja@...gle.com,
Bobby Eshleman <bobbyeshleman@...a.com>
Subject: Re: [PATCH net-next v9 3/5] net: devmem: implement autorelease token
management
On Sun, Jan 11, 2026 at 11:12:19AM -0800, Mina Almasry wrote:
> On Fri, Jan 9, 2026 at 6:19 PM Bobby Eshleman <bobbyeshleman@...il.com> wrote:
> >
> > From: Bobby Eshleman <bobbyeshleman@...a.com>
> >
> > Add support for autorelease toggling of tokens using a static branch to
> > control system-wide behavior. This allows applications to choose between
> > two memory management modes:
> >
> > 1. Autorelease on: Leaked tokens are automatically released when the
> > socket closes.
> >
> > 2. Autorelease off: Leaked tokens are released during dmabuf unbind.
> >
> > The autorelease mode is requested via the NETDEV_A_DMABUF_AUTORELEASE
> > attribute of the NETDEV_CMD_BIND_RX message. Having separate modes per
> > binding is disallowed and is rejected by netlink. The system will be
> > "locked" into the mode that the first binding is set to. It can only be
> > changed again once there are zero bindings on the system.
> >
> > Disabling autorelease offers ~13% improvement in CPU utilization.
> >
> > Static branching is used to limit the system to one mode or the other.
> >
> > Signed-off-by: Bobby Eshleman <bobbyeshleman@...a.com>
> > ---
> > Changes in v9:
> > - Add missing stub for net_devmem_dmabuf_binding_get() when NET_DEVMEM=n
> > - Add wrapper around tcp_devmem_ar_key accesses so that it may be
> > stubbed out when NET_DEVMEM=n
> > - only dec rx binding count for rx bindings in free (v8 did not exclude
> > TX bindings)
> >
> > Changes in v8:
> > - Only reset static key when bindings go to zero, defaulting back to
> > disabled (Stan).
> > - Fix bad usage of xarray spinlock for sleepy static branch switching,
> > use mutex instead.
> > - Access pp_ref_count via niov->desc instead of niov directly.
> > - Move reset of static key to __net_devmem_dmabuf_binding_free() so that
> > the static key can not be changed while there are outstanding tokens
> > (free is only called when reference count reaches zero).
> > - Add net_devmem_dmabuf_rx_bindings_count because tokens may be active
> > even after xa_erase(), so static key changes must wait until all
> > RX bindings are finally freed (not just when xarray is empty). A
> > counter is a simple way to track this.
> > - socket takes reference on the binding, to avoid use-after-free on
> > sk_devmem_info.binding in the case that user releases all tokens,
> > unbinds, then issues SO_DEVMEM_DONTNEED again (with bad token).
> > - removed some comments that were unnecessary
> >
> > Changes in v7:
> > - implement autorelease with static branch (Stan)
> > - use netlink instead of sockopt (Stan)
> > - merge uAPI and implementation patches into one patch (seemed less
> > confusing)
> >
> > Changes in v6:
> > - remove sk_devmem_info.autorelease, using binding->autorelease instead
> > - move binding->autorelease check to outside of
> > net_devmem_dmabuf_binding_put_urefs() (Mina)
> > - remove overly defensive net_is_devmem_iov() (Mina)
> > - add comment about multiple urefs mapping to a single netmem ref (Mina)
> > - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina)
> > - use niov without casting back and forth with netmem (Mina)
> > - move the autorelease flag from per-binding to per-socket (Mina)
> > - remove the batching logic in sock_devmem_dontneed_manual_release()
> > (Mina)
> > - move autorelease check inside tcp_xa_pool_commit() (Mina)
> > - remove single-binding restriction for autorelease mode (Mina)
> > - unbind always checks for leaked urefs
> >
> > Changes in v5:
> > - remove unused variables
> > - introduce autorelease flag, preparing for future patch toggle new
> > behavior
> >
> > Changes in v3:
> > - make urefs per-binding instead of per-socket, reducing memory
> > footprint
> > - fallback to cleaning up references in dmabuf unbind if socket leaked
> > tokens
> > - drop ethtool patch
> >
> > Changes in v2:
> > - always use GFP_ZERO for binding->vec (Mina)
> > - remove WARN for changed binding (Mina)
> > - remove extraneous binding ref get (Mina)
> > - remove WARNs on invalid user input (Mina)
> > - pre-assign niovs in binding->vec for RX case (Mina)
> > - use atomic_set(, 0) to initialize sk_user_frags.urefs
> > - fix length of alloc for urefs
> > ---
> > Documentation/netlink/specs/netdev.yaml | 12 ++++
> > include/net/netmem.h | 1 +
> > include/net/sock.h | 7 ++-
> > include/uapi/linux/netdev.h | 1 +
> > net/core/devmem.c | 104 ++++++++++++++++++++++++++++----
> > net/core/devmem.h | 27 ++++++++-
> > net/core/netdev-genl-gen.c | 5 +-
> > net/core/netdev-genl.c | 10 ++-
> > net/core/sock.c | 57 +++++++++++++++--
> > net/ipv4/tcp.c | 76 ++++++++++++++++++-----
> > net/ipv4/tcp_ipv4.c | 11 +++-
> > net/ipv4/tcp_minisocks.c | 3 +-
> > tools/include/uapi/linux/netdev.h | 1 +
> > 13 files changed, 269 insertions(+), 46 deletions(-)
> >
> > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > index 596c306ce52b..7cbe9e7b9ee5 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -562,6 +562,17 @@ attribute-sets:
> > type: u32
> > checks:
> > min: 1
> > + -
> > + name: autorelease
> > + doc: |
> > + Token autorelease mode. If true (1), leaked tokens are automatically
> > + released when the socket closes. If false (0), leaked tokens are only
> > + released when the dmabuf is unbound. Once a binding is created with a
> > + specific mode, all subsequent bindings system-wide must use the same
> > + mode.
> > +
> > + Optional. Defaults to false if not specified.
> > + type: u8
> >
> > operations:
> > list:
> > @@ -769,6 +780,7 @@ operations:
> > - ifindex
> > - fd
> > - queues
> > + - autorelease
> > reply:
> > attributes:
> > - id
> > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > index 9e10f4ac50c3..80d2263ba4ed 100644
> > --- a/include/net/netmem.h
> > +++ b/include/net/netmem.h
> > @@ -112,6 +112,7 @@ struct net_iov {
> > };
> > struct net_iov_area *owner;
> > enum net_iov_type type;
> > + atomic_t uref;
> > };
> >
> > struct net_iov_area {
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index aafe8bdb2c0f..9d3d5bde15e9 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -352,7 +352,7 @@ struct sk_filter;
> > * @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
> > * @sk_scm_unused: unused flags for scm_recv()
> > * @ns_tracker: tracker for netns reference
> > - * @sk_user_frags: xarray of pages the user is holding a reference on.
> > + * @sk_devmem_info: the devmem binding information for the socket
> > * @sk_owner: reference to the real owner of the socket that calls
> > * sock_lock_init_class_and_name().
> > */
> > @@ -584,7 +584,10 @@ struct sock {
> > struct numa_drop_counters *sk_drop_counters;
> > struct rcu_head sk_rcu;
> > netns_tracker ns_tracker;
> > - struct xarray sk_user_frags;
> > + struct {
> > + struct xarray frags;
> > + struct net_devmem_dmabuf_binding *binding;
> > + } sk_devmem_info;
> >
> > #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> > struct module *sk_owner;
> > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > index e0b579a1df4f..1e5c209cb998 100644
> > --- a/include/uapi/linux/netdev.h
> > +++ b/include/uapi/linux/netdev.h
> > @@ -207,6 +207,7 @@ enum {
> > NETDEV_A_DMABUF_QUEUES,
> > NETDEV_A_DMABUF_FD,
> > NETDEV_A_DMABUF_ID,
> > + NETDEV_A_DMABUF_AUTORELEASE,
> >
> > __NETDEV_A_DMABUF_MAX,
> > NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 05a9a9e7abb9..05c16df657c7 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -11,6 +11,7 @@
> > #include <linux/genalloc.h>
> > #include <linux/mm.h>
> > #include <linux/netdevice.h>
> > +#include <linux/skbuff_ref.h>
> > #include <linux/types.h>
> > #include <net/netdev_queues.h>
> > #include <net/netdev_rx_queue.h>
> > @@ -28,6 +29,19 @@
> >
> > static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
> >
> > +/* If the user unbinds before releasing all tokens, the static key must not
> > + * change until all tokens have been released (to avoid calling the wrong
> > + * SO_DEVMEM_DONTNEED handler). We prevent this by making static key changes
> > + * and binding alloc/free atomic with regards to each other, using the
> > + * devmem_ar_lock. This works because binding free does not occur until all of
> > + * the outstanding token's references on the binding are dropped.
> > + */
> > +static DEFINE_MUTEX(devmem_ar_lock);
> > +
> > +DEFINE_STATIC_KEY_FALSE(tcp_devmem_ar_key);
> > +EXPORT_SYMBOL(tcp_devmem_ar_key);
> > +static int net_devmem_dmabuf_rx_bindings_count;
> > +
> > static const struct memory_provider_ops dmabuf_devmem_ops;
> >
> > bool net_is_devmem_iov(struct net_iov *niov)
> > @@ -60,6 +74,14 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> >
> > size_t size, avail;
> >
> > + if (binding->direction == DMA_FROM_DEVICE) {
> > + mutex_lock(&devmem_ar_lock);
> > + net_devmem_dmabuf_rx_bindings_count--;
> > + if (net_devmem_dmabuf_rx_bindings_count == 0)
> > + static_branch_disable(&tcp_devmem_ar_key);
> > + mutex_unlock(&devmem_ar_lock);
> > + }
> > +
>
> I find this loging with devmem_ar_lock and
> net_devmem_dmabuf_rx_bindigs_count a bit complicated. I wonder if we
> can do another simplification here? Can we have it such that the first
> binding sets the system in autorelease on or autorelease off mode, and
> all future bindings maintain this state? We already don't support
> autorelease on/off mix.
I think that would greatly simplify things. We would still need a lock
to make the static branch change and first release mode setting atomic WRT
each other, but the other parts (like the one above) can be
removed.
>
>
> > gen_pool_for_each_chunk(binding->chunk_pool,
> > net_devmem_dmabuf_free_chunk_owner, NULL);
> >
> > @@ -116,6 +138,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov)
> > gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE);
> > }
> >
> > +static void
> > +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) {
> > + struct net_iov *niov;
> > + netmem_ref netmem;
> > +
> > + niov = binding->vec[i];
> > + netmem = net_iov_to_netmem(niov);
> > +
> > + /* Multiple urefs map to only a single netmem ref. */
> > + if (atomic_xchg(&niov->uref, 0) > 0)
> > + WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > + }
> > +}
> > +
> > void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > {
> > struct netdev_rx_queue *rxq;
> > @@ -143,6 +183,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params);
> > }
> >
> > + net_devmem_dmabuf_binding_put_urefs(binding);
>
> Sigh, I think what you're trying to do here is very complicated. You
> need to think about this scenario:
>
> 1. user binds dmabuf and opens a autorelease=off socket.
> 2. Data arrives on these sockets, and sits in the receive queues,
> recvmsg has not been called yet by the user.
> 3. User unbinds the dma-buff, netmems are still in the receive queues.
> 4. User calls recvmsg on one of these sockets, which obtains a uref on
> the netmems in the receive queues.
> 5. user closes the socket.
>
> With autorelease=on, this works, because the binding remains alive
> until step 5 (even though it's unbound from the queue,
> ..._binding_free has not been called yet) and step 5 cleans up all
> references, even if the binding is unbound but alive, and
>
> calling net_devmem_dmabuf_binding_put_urefs here is weird.
> Autorelease=off implies the user must clean their urefs themselves,
> but we have this here in the unbind path, and it doesn't even
> guarantee that the urefs are free at this point because it may race
> with a recvmsg.
>
> Should we delete this uref cleanup here, and enforce that
> autorelease=off means that the user cleans up the references (the
> kernel never cleans them up on unbind or socket close)? The dontneed
> path needs to work whether the binding is active or unbound.
>
I agree, I think we can do away with the "unbind drops references" idea.
A counter argument could be that it introduces the ability for one
process to interfere with another, but in fact that is already possible
with autorelease=on by not issuing dontneed and starving the other of
tokens.
> > net_devmem_dmabuf_binding_put(binding);
> > }
> >
> > @@ -179,8 +220,10 @@ struct net_devmem_dmabuf_binding *
> > net_devmem_bind_dmabuf(struct net_device *dev,
> > struct device *dma_dev,
> > enum dma_data_direction direction,
> > - unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> > - struct netlink_ext_ack *extack)
> > + unsigned int dmabuf_fd,
> > + struct netdev_nl_sock *priv,
> > + struct netlink_ext_ack *extack,
> > + bool autorelease)
> > {
> > struct net_devmem_dmabuf_binding *binding;
> > static u32 id_alloc_next;
> > @@ -231,14 +274,12 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > goto err_detach;
> > }
> >
> > - if (direction == DMA_TO_DEVICE) {
> > - binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > - sizeof(struct net_iov *),
> > - GFP_KERNEL);
> > - if (!binding->vec) {
> > - err = -ENOMEM;
> > - goto err_unmap;
> > - }
> > + binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> > + sizeof(struct net_iov *),
> > + GFP_KERNEL | __GFP_ZERO);
> > + if (!binding->vec) {
> > + err = -ENOMEM;
> > + goto err_unmap;
> > }
> >
> > /* For simplicity we expect to make PAGE_SIZE allocations, but the
> > @@ -292,25 +333,62 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > niov = &owner->area.niovs[i];
> > niov->type = NET_IOV_DMABUF;
> > niov->owner = &owner->area;
> > + atomic_set(&niov->uref, 0);
> > page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
> > net_devmem_get_dma_addr(niov));
> > - if (direction == DMA_TO_DEVICE)
> > - binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> > + binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov;
> > }
> >
> > virtual += len;
> > }
> >
> > + mutex_lock(&devmem_ar_lock);
> > +
> > + if (direction == DMA_FROM_DEVICE) {
> > + if (net_devmem_dmabuf_rx_bindings_count > 0) {
> > + bool mode;
> > +
> > + mode = static_key_enabled(&tcp_devmem_ar_key);
> > +
> > + /* When bindings exist, enforce that the mode does not
> > + * change.
> > + */
> > + if (mode != autorelease) {
> > + NL_SET_ERR_MSG_FMT(extack,
> > + "System already configured with autorelease=%d",
> > + mode);
> > + err = -EINVAL;
> > + goto err_unlock_mutex;
> > + }
> > + } else if (autorelease) {
> > + /* First binding with autorelease enabled sets the
> > + * mode. If autorelease is false, the key is already
> > + * disabled by default so no action is needed.
> > + */
> > + static_branch_enable(&tcp_devmem_ar_key);
> > + }
> > +
> > + net_devmem_dmabuf_rx_bindings_count++;
> > + }
> > +
> > err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
> > binding, xa_limit_32b, &id_alloc_next,
> > GFP_KERNEL);
> > if (err < 0)
> > - goto err_free_chunks;
> > + goto err_dec_binding_count;
> > +
> > + mutex_unlock(&devmem_ar_lock);
> >
> > list_add(&binding->list, &priv->bindings);
> >
> > return binding;
> >
> > +err_dec_binding_count:
> > + if (direction == DMA_FROM_DEVICE)
> > + net_devmem_dmabuf_rx_bindings_count--;
> > +
> > +err_unlock_mutex:
> > + mutex_unlock(&devmem_ar_lock);
> > err_free_chunks:
> > gen_pool_for_each_chunk(binding->chunk_pool,
> > net_devmem_dmabuf_free_chunk_owner, NULL);
> > diff --git a/net/core/devmem.h b/net/core/devmem.h
> > index 1ea6228e4f40..8c586f30e371 100644
> > --- a/net/core/devmem.h
> > +++ b/net/core/devmem.h
> > @@ -12,9 +12,13 @@
> >
> > #include <net/netmem.h>
> > #include <net/netdev_netlink.h>
> > +#include <linux/jump_label.h>
> >
> > struct netlink_ext_ack;
> >
> > +/* static key for TCP devmem autorelease */
> > +extern struct static_key_false tcp_devmem_ar_key;
> > +
> > struct net_devmem_dmabuf_binding {
> > struct dma_buf *dmabuf;
> > struct dma_buf_attachment *attachment;
> > @@ -61,7 +65,7 @@ struct net_devmem_dmabuf_binding {
> >
> > /* Array of net_iov pointers for this binding, sorted by virtual
> > * address. This array is convenient to map the virtual addresses to
> > - * net_iovs in the TX path.
> > + * net_iovs.
> > */
> > struct net_iov **vec;
> >
> > @@ -88,7 +92,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > struct device *dma_dev,
> > enum dma_data_direction direction,
> > unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
> > - struct netlink_ext_ack *extack);
> > + struct netlink_ext_ack *extack, bool autorelease);
> > struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id);
> > void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
> > int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> > @@ -138,6 +142,11 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
> > schedule_work(&binding->unbind_w);
> > }
> >
> > +static inline bool net_devmem_autorelease_enabled(void)
> > +{
> > + return static_branch_unlikely(&tcp_devmem_ar_key);
> > +}
> > +
> > void net_devmem_get_net_iov(struct net_iov *niov);
> > void net_devmem_put_net_iov(struct net_iov *niov);
> >
> > @@ -155,6 +164,12 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr,
> > #else
> > struct net_devmem_dmabuf_binding;
> >
> > +static inline bool
> > +net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
> > +{
> > + return false;
> > +}
> > +
> > static inline void
> > net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
> > {
> > @@ -174,7 +189,8 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > enum dma_data_direction direction,
> > unsigned int dmabuf_fd,
> > struct netdev_nl_sock *priv,
> > - struct netlink_ext_ack *extack)
> > + struct netlink_ext_ack *extack,
> > + bool autorelease)
> > {
> > return ERR_PTR(-EOPNOTSUPP);
> > }
> > @@ -241,6 +257,11 @@ net_devmem_iov_binding(const struct net_iov *niov)
> > {
> > return NULL;
> > }
> > +
> > +static inline bool net_devmem_autorelease_enabled(void)
> > +{
> > + return false;
> > +}
> > #endif
> >
> > #endif /* _NET_DEVMEM_H */
> > diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> > index ba673e81716f..01b7765e11ec 100644
> > --- a/net/core/netdev-genl-gen.c
> > +++ b/net/core/netdev-genl-gen.c
> > @@ -86,10 +86,11 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
> > };
> >
> > /* NETDEV_CMD_BIND_RX - do */
> > -static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1] = {
> > +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_AUTORELEASE + 1] = {
> > [NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
> > [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
> > [NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
> > + [NETDEV_A_DMABUF_AUTORELEASE] = { .type = NLA_U8, },
> > };
> >
> > /* NETDEV_CMD_NAPI_SET - do */
> > @@ -188,7 +189,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
> > .cmd = NETDEV_CMD_BIND_RX,
> > .doit = netdev_nl_bind_rx_doit,
> > .policy = netdev_bind_rx_nl_policy,
> > - .maxattr = NETDEV_A_DMABUF_FD,
> > + .maxattr = NETDEV_A_DMABUF_AUTORELEASE,
> > .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
> > },
> > {
> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > index 470fabbeacd9..c742bb34865e 100644
> > --- a/net/core/netdev-genl.c
> > +++ b/net/core/netdev-genl.c
> > @@ -939,6 +939,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> > struct netdev_nl_sock *priv;
> > struct net_device *netdev;
> > unsigned long *rxq_bitmap;
> > + bool autorelease = false;
> > struct device *dma_dev;
> > struct sk_buff *rsp;
> > int err = 0;
> > @@ -952,6 +953,10 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> > ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
> > dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]);
> >
> > + if (info->attrs[NETDEV_A_DMABUF_AUTORELEASE])
> > + autorelease =
> > + !!nla_get_u8(info->attrs[NETDEV_A_DMABUF_AUTORELEASE]);
> > +
> > priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
> > if (IS_ERR(priv))
> > return PTR_ERR(priv);
> > @@ -1002,7 +1007,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> > }
> >
> > binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
> > - dmabuf_fd, priv, info->extack);
> > + dmabuf_fd, priv, info->extack,
> > + autorelease);
> > if (IS_ERR(binding)) {
> > err = PTR_ERR(binding);
> > goto err_rxq_bitmap;
> > @@ -1097,7 +1103,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
> >
> > dma_dev = netdev_queue_get_dma_dev(netdev, 0);
> > binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
> > - dmabuf_fd, priv, info->extack);
> > + dmabuf_fd, priv, info->extack, false);
> > if (IS_ERR(binding)) {
> > err = PTR_ERR(binding);
> > goto err_unlock_netdev;
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index f6526f43aa6e..6355c2ccfb8a 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -87,6 +87,7 @@
> >
> > #include <linux/unaligned.h>
> > #include <linux/capability.h>
> > +#include <linux/dma-buf.h>
> > #include <linux/errno.h>
> > #include <linux/errqueue.h>
> > #include <linux/types.h>
> > @@ -151,6 +152,7 @@
> > #include <uapi/linux/pidfd.h>
> >
> > #include "dev.h"
> > +#include "devmem.h"
> >
> > static DEFINE_MUTEX(proto_list_mutex);
> > static LIST_HEAD(proto_list);
> > @@ -1081,6 +1083,44 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
> > #define MAX_DONTNEED_TOKENS 128
> > #define MAX_DONTNEED_FRAGS 1024
> >
> > +static noinline_for_stack int
> > +sock_devmem_dontneed_manual_release(struct sock *sk,
> > + struct dmabuf_token *tokens,
> > + unsigned int num_tokens)
> > +{
> > + struct net_iov *niov;
> > + unsigned int i, j;
> > + netmem_ref netmem;
> > + unsigned int token;
> > + int num_frags = 0;
> > + int ret = 0;
> > +
> > + if (!sk->sk_devmem_info.binding)
> > + return -EINVAL;
> > +
> > + for (i = 0; i < num_tokens; i++) {
> > + for (j = 0; j < tokens[i].token_count; j++) {
> > + size_t size = sk->sk_devmem_info.binding->dmabuf->size;
> > +
> > + token = tokens[i].token_start + j;
> > + if (token >= size / PAGE_SIZE)
> > + break;
> > +
> > + if (++num_frags > MAX_DONTNEED_FRAGS)
> > + return ret;
> > +
> > + niov = sk->sk_devmem_info.binding->vec[token];
> > + if (atomic_dec_and_test(&niov->uref)) {
> > + netmem = net_iov_to_netmem(niov);
> > + WARN_ON_ONCE(!napi_pp_put_page(netmem));
> > + }
> > + ret++;
> > + }
> > + }
> > +
> > + return ret;
> > +}
> > +
> > static noinline_for_stack int
> > sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
> > unsigned int num_tokens)
> > @@ -1089,32 +1129,33 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
> > int ret = 0, num_frags = 0;
> > netmem_ref netmems[16];
> >
> > - xa_lock_bh(&sk->sk_user_frags);
> > + xa_lock_bh(&sk->sk_devmem_info.frags);
> > for (i = 0; i < num_tokens; i++) {
> > for (j = 0; j < tokens[i].token_count; j++) {
> > if (++num_frags > MAX_DONTNEED_FRAGS)
> > goto frag_limit_reached;
> >
> > netmem_ref netmem = (__force netmem_ref)__xa_erase(
> > - &sk->sk_user_frags, tokens[i].token_start + j);
> > + &sk->sk_devmem_info.frags,
> > + tokens[i].token_start + j);
> >
> > if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> > continue;
> >
> > netmems[netmem_num++] = netmem;
> > if (netmem_num == ARRAY_SIZE(netmems)) {
> > - xa_unlock_bh(&sk->sk_user_frags);
> > + xa_unlock_bh(&sk->sk_devmem_info.frags);
> > for (k = 0; k < netmem_num; k++)
> > WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> > netmem_num = 0;
> > - xa_lock_bh(&sk->sk_user_frags);
> > + xa_lock_bh(&sk->sk_devmem_info.frags);
> > }
> > ret++;
> > }
> > }
> >
> > frag_limit_reached:
> > - xa_unlock_bh(&sk->sk_user_frags);
> > + xa_unlock_bh(&sk->sk_devmem_info.frags);
> > for (k = 0; k < netmem_num; k++)
> > WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> >
> > @@ -1145,7 +1186,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> > return -EFAULT;
> > }
> >
> > - ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> > + if (net_devmem_autorelease_enabled())
> > + ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens);
> > + else
> > + ret = sock_devmem_dontneed_manual_release(sk, tokens,
> > + num_tokens);
> >
> > kvfree(tokens);
> > return ret;
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index d5319ebe2452..a8a4af552909 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -260,6 +260,7 @@
> > #include <linux/memblock.h>
> > #include <linux/highmem.h>
> > #include <linux/cache.h>
> > +#include <linux/dma-buf.h>
> > #include <linux/err.h>
> > #include <linux/time.h>
> > #include <linux/slab.h>
> > @@ -492,7 +493,8 @@ void tcp_init_sock(struct sock *sk)
> >
> > set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
> > sk_sockets_allocated_inc(sk);
> > - xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1);
> > + xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> > + sk->sk_devmem_info.binding = NULL;
> > }
> > EXPORT_IPV6_MOD(tcp_init_sock);
> >
> > @@ -2424,11 +2426,12 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
> >
> > /* Commit part that has been copied to user space. */
> > for (i = 0; i < p->idx; i++)
> > - __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY,
> > - (__force void *)p->netmems[i], GFP_KERNEL);
> > + __xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i],
> > + XA_ZERO_ENTRY, (__force void *)p->netmems[i],
> > + GFP_KERNEL);
> > /* Rollback what has been pre-allocated and is no longer needed. */
> > for (; i < p->max; i++)
> > - __xa_erase(&sk->sk_user_frags, p->tokens[i]);
> > + __xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
> >
> > p->max = 0;
> > p->idx = 0;
> > @@ -2436,14 +2439,18 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
> >
> > static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p)
> > {
> > + /* Skip xarray operations if autorelease is disabled (manual mode) */
> > + if (!net_devmem_autorelease_enabled())
> > + return;
> > +
> > if (!p->max)
> > return;
> >
> > - xa_lock_bh(&sk->sk_user_frags);
> > + xa_lock_bh(&sk->sk_devmem_info.frags);
> >
> > tcp_xa_pool_commit_locked(sk, p);
> >
> > - xa_unlock_bh(&sk->sk_user_frags);
> > + xa_unlock_bh(&sk->sk_devmem_info.frags);
> > }
> >
> > static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> > @@ -2454,24 +2461,41 @@ static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> > if (p->idx < p->max)
> > return 0;
> >
> > - xa_lock_bh(&sk->sk_user_frags);
> > + xa_lock_bh(&sk->sk_devmem_info.frags);
> >
> > tcp_xa_pool_commit_locked(sk, p);
> >
> > for (k = 0; k < max_frags; k++) {
> > - err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
> > + err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k],
> > XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
> > if (err)
> > break;
> > }
> >
> > - xa_unlock_bh(&sk->sk_user_frags);
> > + xa_unlock_bh(&sk->sk_devmem_info.frags);
> >
> > p->max = k;
> > p->idx = 0;
> > return k ? 0 : err;
> > }
> >
> > +static void tcp_xa_pool_inc_pp_ref_count(struct tcp_xa_pool *tcp_xa_pool,
> > + skb_frag_t *frag)
> > +{
> > + struct net_iov *niov;
> > +
> > + niov = skb_frag_net_iov(frag);
> > +
> > + if (net_devmem_autorelease_enabled()) {
> > + atomic_long_inc(&niov->desc.pp_ref_count);
> > + tcp_xa_pool->netmems[tcp_xa_pool->idx++] =
> > + skb_frag_netmem(frag);
> > + } else {
> > + if (atomic_inc_return(&niov->uref) == 1)
> > + atomic_long_inc(&niov->desc.pp_ref_count);
> > + }
> > +}
> > +
> > /* On error, returns the -errno. On success, returns number of bytes sent to the
> > * user. May not consume all of @remaining_len.
> > */
> > @@ -2533,6 +2557,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> > * sequence of cmsg
> > */
> > for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> > + struct net_devmem_dmabuf_binding *binding = NULL;
> > skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> > struct net_iov *niov;
> > u64 frag_offset;
> > @@ -2568,13 +2593,35 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> > start;
> > dmabuf_cmsg.frag_offset = frag_offset;
> > dmabuf_cmsg.frag_size = copy;
> > - err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
> > - skb_shinfo(skb)->nr_frags - i);
> > - if (err)
> > +
> > + binding = net_devmem_iov_binding(niov);
> > +
> > + if (!sk->sk_devmem_info.binding) {
> > + net_devmem_dmabuf_binding_get(binding);
> > + sk->sk_devmem_info.binding = binding;
> > + }
> > +
> > + if (sk->sk_devmem_info.binding != binding) {
> > + err = -EFAULT;
> > goto out;
> > + }
> > +
> > + if (net_devmem_autorelease_enabled()) {
> > + err = tcp_xa_pool_refill(sk,
> > + &tcp_xa_pool,
> > + skb_shinfo(skb)->nr_frags - i);
> > + if (err)
> > + goto out;
> > +
> > + dmabuf_cmsg.frag_token =
> > + tcp_xa_pool.tokens[tcp_xa_pool.idx];
> > + } else {
> > + dmabuf_cmsg.frag_token =
> > + net_iov_virtual_addr(niov) >> PAGE_SHIFT;
> > + }
> > +
> >
> > /* Will perform the exchange later */
> > - dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
> > dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
> >
> > offset += copy;
> > @@ -2587,8 +2634,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> > if (err)
> > goto out;
> >
> > - atomic_long_inc(&niov->desc.pp_ref_count);
> > - tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
> > + tcp_xa_pool_inc_pp_ref_count(&tcp_xa_pool, frag);
> >
> > sent += copy;
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index f8a9596e8f4d..7b1b5a17002f 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -89,6 +89,9 @@
> >
> > #include <crypto/md5.h>
> >
> > +#include <linux/dma-buf.h>
> > +#include "../core/devmem.h"
> > +
> > #include <trace/events/tcp.h>
> >
> > #ifdef CONFIG_TCP_MD5SIG
> > @@ -2492,7 +2495,7 @@ static void tcp_release_user_frags(struct sock *sk)
> > unsigned long index;
> > void *netmem;
> >
> > - xa_for_each(&sk->sk_user_frags, index, netmem)
> > + xa_for_each(&sk->sk_devmem_info.frags, index, netmem)
> > WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
> > #endif
> > }
> > @@ -2503,7 +2506,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
> >
> > tcp_release_user_frags(sk);
> >
> > - xa_destroy(&sk->sk_user_frags);
> > + xa_destroy(&sk->sk_devmem_info.frags);
> > + if (sk->sk_devmem_info.binding) {
> > + net_devmem_dmabuf_binding_put(sk->sk_devmem_info.binding);
>
> I don't understand the refcounting relationsship between the binding
> and the socket when autorelease=off. It seems you're grabbing a
> reference on the binding on every recvmsg(), but only dropping the
> reference on the binding once here, so the binding is never freed.
We grab a ref only the one time that sk_devmem_info.binding is set, and
then we release it before setting sk_devmem_info.binding back to NULL.
This is only done to avoid use-after-free on this pointer in the
situation:
1. socket releases all tokens
2. dmabuf is unbound (ref count reaches zero, it is freed)
3. socket sends some bad token to dontneed which dereferences
use-after-free sk_devmem_info.binding
The other option was bindings having a list of sockets that use it or
something like that... but I think this is a good use case for the ref
count.
>
> > + sk->sk_devmem_info.binding = NULL;
> > + }
> >
> > trace_tcp_destroy_sock(sk);
> >
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index bd5462154f97..2aec977f5c12 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -662,7 +662,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
> >
> > __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
> >
> > - xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
> > + xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1);
> > + newsk->sk_devmem_info.binding = NULL;
> >
> > return newsk;
> > }
> > diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> > index e0b579a1df4f..1e5c209cb998 100644
> > --- a/tools/include/uapi/linux/netdev.h
> > +++ b/tools/include/uapi/linux/netdev.h
> > @@ -207,6 +207,7 @@ enum {
> > NETDEV_A_DMABUF_QUEUES,
> > NETDEV_A_DMABUF_FD,
> > NETDEV_A_DMABUF_ID,
> > + NETDEV_A_DMABUF_AUTORELEASE,
> >
> > __NETDEV_A_DMABUF_MAX,
> > NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> >
> > --
> > 2.47.3
> >
>
>
> --
> Thanks,
> Mina
Thanks for the review Mina!
Best,
Bobby
Powered by blists - more mailing lists