linux-kernel - Re: [PATCH net-next v5 3/4] net: devmem: use niov array for token management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izOnir-YHPH+R0cQzu1i0HD2Z0csW6qUT8FFXh1PkmHabQ@mail.gmail.com>
Date: Tue, 28 Oct 2025 19:04:15 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Bobby Eshleman <bobbyeshleman@...il.com>
Cc: "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Neal Cardwell <ncardwell@...gle.com>, David Ahern <dsahern@...nel.org>, 
	Stanislav Fomichev <sdf@...ichev.me>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Bobby Eshleman <bobbyeshleman@...a.com>
Subject: Re: [PATCH net-next v5 3/4] net: devmem: use niov array for token management

On Tue, Oct 28, 2025 at 1:49 PM Bobby Eshleman <bobbyeshleman@...il.com> wrote:
...
> > > @@ -307,6 +331,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
> > >                 goto err_free_chunks;
> > >
> > >         list_add(&binding->list, &priv->bindings);
> > > +       binding->autorelease = true;
> > >
> >
> > So autorelease is indeed a property of the binding. Not sure why a
> > copy exists in sk_devmem_info. Perf optimization to reduce pointer
> > chasing?
> >
>
> Just stale code from prior design... Originally, I was going to try to
> allow the autorelease == true case to be free of the
> one-binding-per-socket restriction, in which case sk_devmem_info.binding
> would be NULL (or otherwise meaningless). sk_devmem_info.autorelease
> allowed sock_devmem_dontneed to choose the right path even when
> sk_devmem_info.binding == NULL.
>
> ...but then I realized we still needed some restriction to avoid sockets
> from steering into different dmabufs with different autorelease configs,
> so kept the one-binding restriction for both modes. I abandoned the
> effort, but forgot to revert this change.
>
> Now I'm realizing that we could relax the restriction more though... We
> could allow sockets to steer into other bindings if they all have the
> same autorelease value? Then we could still use
> sk_devmem_info.binding->autorelease in the sock_devmem_dontneed path and
> relax the restriction to "steering must only be to bindings of the same
> autorelease mode"?
>

Hmpf. I indeed forgot to think thoroughly about the case where, for
some god-forsaken reason, we have bindings on the system with
different auto-release values.

But now that I think more, I don't fully grasp why that would be a
problem. I think we can make it all work by making autorelease a
property of the socket, not the binding:

So if sk->devmem_info.autorelease is on, in recevmsg we store the
token in the xarray and dontneed frees from the xarray (both can check
skb->devmem_info.autorelease).

If sk->devmem_info.autorelease is off, then in recvmsg we grab the
binding from sk->devmem_info.binding, and we do a uref inc and netmem
get ref, then in dontneed dec uref and napi_pp_put_page if necessary.

The side effect of that is that for the same binding, we may
simultaneously have refs in the sk->xarray and in the binding->uref,
because the data landing on the binding sometimes belonged to a socket
with sk->devmem_info.autorelease on or off, but I don't immediately
see why that would be a problem. The xarray refs would be removed on
socket close, the urefs would be freed on unbind.

Doesn't it all work? Or am I insane?

> > I thought autorelease would be 'false' by default. I.e. the user gets
> > better perf by default. Maybe this is flipped in the following patch I
> > haven't opened yet.
> >
>
> That is right, it is in the next patch. I didn't want to change the
> default without also adding the ability to revert back.
>
> > >         return binding;
> > >
> > > diff --git a/net/core/devmem.h b/net/core/devmem.h
> > > index 2ada54fb63d7..7662e9e42c35 100644
> > > --- a/net/core/devmem.h
> > > +++ b/net/core/devmem.h
> > > @@ -61,11 +61,20 @@ struct net_devmem_dmabuf_binding {
> > >
> > >         /* Array of net_iov pointers for this binding, sorted by virtual
> > >          * address. This array is convenient to map the virtual addresses to
> > > -        * net_iovs in the TX path.
> > > +        * net_iovs.
> > >          */
> > >         struct net_iov **vec;
> > >
> > >         struct work_struct unbind_w;
> > > +
> > > +       /* If true, outstanding tokens will be automatically released upon each
> > > +        * socket's close(2).
> > > +        *
> > > +        * If false, then sockets are responsible for releasing tokens before
> > > +        * close(2). The kernel will only release lingering tokens when the
> > > +        * dmabuf is unbound.
> > > +        */
> >
> > Needs devmem.rst doc update.
> >
>
> Will do.
>
> > > +       bool autorelease;
> > >  };
> > >
> > >  #if defined(CONFIG_NET_DEVMEM)
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index e7b378753763..595b5a858d03 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -87,6 +87,7 @@
> > >
> > >  #include <linux/unaligned.h>
> > >  #include <linux/capability.h>
> > > +#include <linux/dma-buf.h>
> > >  #include <linux/errno.h>
> > >  #include <linux/errqueue.h>
> > >  #include <linux/types.h>
> > > @@ -151,6 +152,7 @@
> > >  #include <uapi/linux/pidfd.h>
> > >
> > >  #include "dev.h"
> > > +#include "devmem.h"
> > >
> > >  static DEFINE_MUTEX(proto_list_mutex);
> > >  static LIST_HEAD(proto_list);
> > > @@ -1081,6 +1083,57 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
> > >  #define MAX_DONTNEED_TOKENS 128
> > >  #define MAX_DONTNEED_FRAGS 1024
> > >
> > > +static noinline_for_stack int
> > > +sock_devmem_dontneed_manual_release(struct sock *sk, struct dmabuf_token *tokens,
> > > +                                   unsigned int num_tokens)
> > > +{
> > > +       unsigned int netmem_num = 0;
> > > +       int ret = 0, num_frags = 0;
> > > +       netmem_ref netmems[16];
> > > +       struct net_iov *niov;
> > > +       unsigned int i, j, k;
> > > +
> > > +       for (i = 0; i < num_tokens; i++) {
> > > +               for (j = 0; j < tokens[i].token_count; j++) {
> > > +                       struct net_iov *niov;
> > > +                       unsigned int token;
> > > +                       netmem_ref netmem;
> > > +
> > > +                       token = tokens[i].token_start + j;
> > > +                       if (token >= sk->sk_devmem_info.binding->dmabuf->size / PAGE_SIZE)
> > > +                               break;
> > > +
> >
> > This requires some thought. The correct thing to do here is EINVAL
> > without modifying the urefs at all I think. You may need an
> > input-verification loop. Breaking and returning success here is not
> > great, I think.
> >
>
> Should this also be changed for the other path as well? Right now if
> __xa_erase returns NULL (e.g., user passed in a bad token), then we hit
> "continue" and process the next token... eventually just returning the
> number of tokens that were successfully processed and omitting the wrong
> ones.
>

Ugh. I did not notice that :(

I guess the existing dontneed doesn't handle that well anyway. Lets
not fix too much in this series. It's fine to carry that behavior in
the new implementation and if anything improve this in a separate
patch (for me at least). It'd be a bit weird if the userspace is
sending us bad tokens anyway, in theory.

> > > +                       if (++num_frags > MAX_DONTNEED_FRAGS)
> > > +                               goto frag_limit_reached;
> > > +                       niov = sk->sk_devmem_info.binding->vec[token];
> > > +                       netmem = net_iov_to_netmem(niov);
> > > +
> > > +                       if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> > > +                               continue;
> > > +
> >
> > This check is extremely defensive. There is no way netmem is not a
> > netiov (you just converted it). It's also very hard for it to be NULL.
> > Remove maybe.
> >
>
> Removing sounds good to me.
>
> > > +                       netmems[netmem_num++] = netmem;
> > > +                       if (netmem_num == ARRAY_SIZE(netmems)) {
> > > +                               for (k = 0; k < netmem_num; k++) {
> > > +                                       niov = netmem_to_net_iov(netmems[k]);
> >
> > No need to upcast to netmem only to downcast here back to net_iov.
> > Just keep it net_iov?
> >
>
> Sounds good.
>
> > > +                                       if (atomic_dec_and_test(&niov->uref))
> > > +                                               WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> >
> > I see. So you're only only napi_pp_put_pageing the last uref dec, but
> > why? How about you
>
> Just to minimize cache bus noise from the extra atomic.
>
> > > +                               }
> > > +                               netmem_num = 0;
> > > +                       }
> >
> > From a quick look, I don't think you need the netmems[] array or this
> > inner loop.
> >
> > dontneed_autorelease is complicatingly written because it acquires a
> > lock, and we wanted to minimize the lock acquire. You are acquiring no
> > lock here. AFAICT you can just loop over the tokens and free all of
> > them in one (nested) loop.
>
> Oh good point. There is no need to batch at all here.
>
> >
> > > +                       ret++;
> > > +               }
> > > +       }
> > > +
> > > +frag_limit_reached:
> > > +       for (k = 0; k < netmem_num; k++) {
> > > +               niov = netmem_to_net_iov(netmems[k]);
> > > +               if (atomic_dec_and_test(&niov->uref))
> > > +                       WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> > > +       }
> > > +
> > > +       return ret;
> > > +}
> > > +
> > >  static noinline_for_stack int
> > >  sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
> > >                                  unsigned int num_tokens)
> > > @@ -1089,32 +1142,32 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens,
> > >         int ret = 0, num_frags = 0;
> > >         netmem_ref netmems[16];
> > >
> > > -       xa_lock_bh(&sk->sk_user_frags);
> > > +       xa_lock_bh(&sk->sk_devmem_info.frags);
> > >         for (i = 0; i < num_tokens; i++) {
> > >                 for (j = 0; j < tokens[i].token_count; j++) {
> > >                         if (++num_frags > MAX_DONTNEED_FRAGS)
> > >                                 goto frag_limit_reached;
> > >
> > >                         netmem_ref netmem = (__force netmem_ref)__xa_erase(
> > > -                               &sk->sk_user_frags, tokens[i].token_start + j);
> > > +                               &sk->sk_devmem_info.frags, tokens[i].token_start + j);
> > >
> > >                         if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> > >                                 continue;
> > >
> > >                         netmems[netmem_num++] = netmem;
> > >                         if (netmem_num == ARRAY_SIZE(netmems)) {
> > > -                               xa_unlock_bh(&sk->sk_user_frags);
> > > +                               xa_unlock_bh(&sk->sk_devmem_info.frags);
> > >                                 for (k = 0; k < netmem_num; k++)
> > >                                         WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> > >                                 netmem_num = 0;
> > > -                               xa_lock_bh(&sk->sk_user_frags);
> > > +                               xa_lock_bh(&sk->sk_devmem_info.frags);
> > >                         }
> > >                         ret++;
> > >                 }
> > >         }
> > >
> > >  frag_limit_reached:
> > > -       xa_unlock_bh(&sk->sk_user_frags);
> > > +       xa_unlock_bh(&sk->sk_devmem_info.frags);
> > >         for (k = 0; k < netmem_num; k++)
> > >                 WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> > >
> > > @@ -1135,6 +1188,12 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> > >             optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
> > >                 return -EINVAL;
> > >
> > > +       /* recvmsg() has never returned a token for this socket, which needs to
> > > +        * happen before we know if the dmabuf has autorelease set or not.
> > > +        */
> > > +       if (!sk->sk_devmem_info.binding)
> > > +               return -EINVAL;
> > > +
> >
> > Hmm. At first glance I don't think enforcing this condition if
> > binding->autorelease is necessary, no?
> >
> > If autorelease is on, then we track the tokens the old way, and we
> > don't need a binding, no? If it's off, then we need an associated
> > binding, to look up the urefs array.
> >
>
> We at least need the binding to know if binding->autorelease is on,
> since without that we don't know whether the tokens are in the xarray or
> binding->vec... but I guess we could also check if the xarray is
> non-empty and infer that autorelease == true from that?
>

I think as above, if autorelease is (only) a property of the sockets,
then the xarray path works without introducing the socket-to-binding
mapping restriction, yes?

-- 
Thanks,
Mina