lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAHS8izNjdDwtf-Zb+wbmWW4k6+9=fnpY4XO_G=xMu4M-TaMw5Q@mail.gmail.com>
Date: Mon, 24 Mar 2025 12:46:23 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Stanislav Fomichev <stfomichev@...il.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-doc@...r.kernel.org, kvm@...r.kernel.org, 
	virtualization@...ts.linux.dev, linux-kselftest@...r.kernel.org, 
	Donald Hunter <donald.hunter@...il.com>, Jakub Kicinski <kuba@...nel.org>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, Jonathan Corbet <corbet@....net>, 
	Andrew Lunn <andrew+netdev@...n.ch>, Jeroen de Borst <jeroendb@...gle.com>, 
	Harshitha Ramamurthy <hramamurthy@...gle.com>, Kuniyuki Iwashima <kuniyu@...zon.com>, 
	Willem de Bruijn <willemb@...gle.com>, David Ahern <dsahern@...nel.org>, 
	Neal Cardwell <ncardwell@...gle.com>, Stefan Hajnoczi <stefanha@...hat.com>, 
	Stefano Garzarella <sgarzare@...hat.com>, "Michael S. Tsirkin" <mst@...hat.com>, Jason Wang <jasowang@...hat.com>, 
	Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, Eugenio Pérez <eperezma@...hat.com>, 
	Shuah Khan <shuah@...nel.org>, sdf@...ichev.me, asml.silence@...il.com, dw@...idwei.uk, 
	Jamal Hadi Salim <jhs@...atatu.com>, Victor Nogueira <victor@...atatu.com>, 
	Pedro Tammela <pctammela@...atatu.com>, Samiullah Khawaja <skhawaja@...gle.com>, 
	Kaiyuan Zhang <kaiyuanz@...gle.com>
Subject: Re: [PATCH net-next v7 4/9] net: devmem: Implement TX path

On Mon, Mar 24, 2025 at 9:25 AM Stanislav Fomichev <stfomichev@...il.com> wrote:
>
> On 03/08, Mina Almasry wrote:
> > Augment dmabuf binding to be able to handle TX. Additional to all the RX
> > binding, we also create tx_vec needed for the TX path.
> >
> > Provide API for sendmsg to be able to send dmabufs bound to this device:
> >
> > - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
> > - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.
> >
> > Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
> > implementation, while disabling instances where MSG_ZEROCOPY falls back
> > to copying.
> >
> > We additionally pipe the binding down to the new
> > zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
> > instead of the traditional page netmems.
> >
> > We also special case skb_frag_dma_map to return the dma-address of these
> > dmabuf net_iovs instead of attempting to map pages.
> >
> > The TX path may release the dmabuf in a context where we cannot wait.
> > This happens when the user unbinds a TX dmabuf while there are still
> > references to its netmems in the TX path. In that case, the netmems will
> > be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
> > this by making __net_devmem_dmabuf_binding_free schedule_work'd.
> >
> > Based on work by Stanislav Fomichev <sdf@...ichev.me>. A lot of the meat
> > of the implementation came from devmem TCP RFC v1[1], which included the
> > TX path, but Stan did all the rebasing on top of netmem/net_iov.
> >
> > Cc: Stanislav Fomichev <sdf@...ichev.me>
> > Signed-off-by: Kaiyuan Zhang <kaiyuanz@...gle.com>
> > Signed-off-by: Mina Almasry <almasrymina@...gle.com>
> > Acked-by: Stanislav Fomichev <sdf@...ichev.me>
> >
> > ---
> >
> > v6:
> > - Retain behavior that MSG_FASTOPEN succeeds even if cmsg is invalid
> >   (Paolo).
> > - Rework the freeing of tx_vec slightly to improve readability. Now it
> >   has its own err label (Paolo).
> > - Squash making unbinding scheduled work (Paolo).
> > - Add comment to clarify that net_iovs stuck in the transmit path hold
> >   a ref on the underlying dmabuf binding (David).
> > - Fix the comment on how binding refcounting works on RX (the comment
> >   was not matching the existing code behavior).
> >
> > v5:
> > - Return -EFAULT from zerocopy_fill_skb_from_devmem (Stan)
> > - don't null check before kvfree (stan).
> >
> > v4:
> > - Remove dmabuf_tx_cmsg definition and just use __u32 for the dma-buf id
> >   (Willem).
> > - Check that iov_iter_type() is ITER_IOVEC in
> >   zerocopy_fill_skb_from_iter() (Pavel).
> > - Fix binding->tx_vec not being freed on error paths (Paolo).
> > - Make devmem patch mutually exclusive with msg->ubuf_info path (Pavel).
> > - Check that MSG_ZEROCOPY and SOCK_ZEROCOPY are provided when
> >   sockc.dmabuf_id is provided.
> > - Don't mm_account_pinned_pages() on devmem TX (Pavel).
> >
> > v3:
> > - Use kvmalloc_array instead of kcalloc (Stan).
> > - Fix unreachable code warning (Simon).
> >
> > v2:
> > - Remove dmabuf_offset from the dmabuf cmsg.
> > - Update zerocopy_fill_skb_from_devmem to interpret the
> >   iov_base/iter_iov_addr as the offset into the dmabuf to send from
> >   (Stan).
> > - Remove the confusing binding->tx_iter which is not needed if we
> >   interpret the iov_base/iter_iov_addr as offset into the dmabuf (Stan).
> > - Remove check for binding->sgt and binding->sgt->nents in dmabuf
> >   binding.
> > - Simplify the calculation of binding->tx_vec.
> > - Check in net_devmem_get_binding that the binding we're returning
> >   has ifindex matching the sending socket (Willem).
> > ---
> >  include/linux/skbuff.h                  |  17 +++-
> >  include/net/sock.h                      |   1 +
> >  net/core/datagram.c                     |  48 ++++++++++-
> >  net/core/devmem.c                       | 105 ++++++++++++++++++++++--
> >  net/core/devmem.h                       |  61 +++++++++++---
> >  net/core/netdev-genl.c                  |  64 ++++++++++++++-
> >  net/core/skbuff.c                       |  18 ++--
> >  net/core/sock.c                         |   6 ++
> >  net/ipv4/ip_output.c                    |   3 +-
> >  net/ipv4/tcp.c                          |  50 ++++++++---
> >  net/ipv6/ip6_output.c                   |   3 +-
> >  net/vmw_vsock/virtio_transport_common.c |   5 +-
> >  12 files changed, 330 insertions(+), 51 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 14517e95a46c..67a7e069a9bf 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1707,13 +1707,16 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset)
> >  extern const struct ubuf_info_ops msg_zerocopy_ubuf_ops;
> >
> >  struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> > -                                    struct ubuf_info *uarg);
> > +                                    struct ubuf_info *uarg, bool devmem);
> >
> >  void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
> >
> > +struct net_devmem_dmabuf_binding;
> > +
> >  int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
> >                           struct sk_buff *skb, struct iov_iter *from,
> > -                         size_t length);
> > +                         size_t length,
> > +                         struct net_devmem_dmabuf_binding *binding);
> >
> >  int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >                               struct iov_iter *from, size_t length);
> > @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >  static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
> >                                         struct msghdr *msg, int len)
> >  {
> > -     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
> > +     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
> > +                                    NULL);
> >  }
> >
> >  int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
> >                            struct msghdr *msg, int len,
> > -                          struct ubuf_info *uarg);
> > +                          struct ubuf_info *uarg,
> > +                          struct net_devmem_dmabuf_binding *binding);
> >
> >  /* Internal */
> >  #define skb_shinfo(SKB)      ((struct skb_shared_info *)(skb_end_pointer(SKB)))
> > @@ -3697,6 +3702,10 @@ static inline dma_addr_t __skb_frag_dma_map(struct device *dev,
> >                                           size_t offset, size_t size,
> >                                           enum dma_data_direction dir)
> >  {
> > +     if (skb_frag_is_net_iov(frag)) {
> > +             return netmem_to_net_iov(frag->netmem)->dma_addr + offset +
> > +                    frag->offset;
> > +     }
> >       return dma_map_page(dev, skb_frag_page(frag),
> >                           skb_frag_off(frag) + offset, size, dir);
> >  }
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 8daf1b3b12c6..59875bed75e7 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1816,6 +1816,7 @@ struct sockcm_cookie {
> >       u32 tsflags;
> >       u32 ts_opt_id;
> >       u32 priority;
> > +     u32 dmabuf_id;
> >  };
> >
> >  static inline void sockcm_init(struct sockcm_cookie *sockc,
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index f0693707aece..09c74a1d836b 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -63,6 +63,8 @@
> >  #include <net/busy_poll.h>
> >  #include <crypto/hash.h>
> >
> > +#include "devmem.h"
> > +
> >  /*
> >   *   Is a socket 'connection oriented' ?
> >   */
> > @@ -692,9 +694,49 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >       return 0;
> >  }
> >
> > +static int
> > +zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from,
> > +                           int length,
> > +                           struct net_devmem_dmabuf_binding *binding)
> > +{
> > +     int i = skb_shinfo(skb)->nr_frags;
> > +     size_t virt_addr, size, off;
> > +     struct net_iov *niov;
> > +
> > +     /* Devmem filling works by taking an IOVEC from the user where the
> > +      * iov_addrs are interpreted as an offset in bytes into the dma-buf to
> > +      * send from. We do not support other iter types.
> > +      */
> > +     if (iov_iter_type(from) != ITER_IOVEC)
> > +             return -EFAULT;
> > +
> > +     while (length && iov_iter_count(from)) {
> > +             if (i == MAX_SKB_FRAGS)
> > +                     return -EMSGSIZE;
> > +
> > +             virt_addr = (size_t)iter_iov_addr(from);
> > +             niov = net_devmem_get_niov_at(binding, virt_addr, &off, &size);
> > +             if (!niov)
> > +                     return -EFAULT;
> > +
> > +             size = min_t(size_t, size, length);
> > +             size = min_t(size_t, size, iter_iov_len(from));
> > +
> > +             get_netmem(net_iov_to_netmem(niov));
> > +             skb_add_rx_frag_netmem(skb, i, net_iov_to_netmem(niov), off,
> > +                                    size, PAGE_SIZE);
> > +             iov_iter_advance(from, size);
> > +             length -= size;
> > +             i++;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> >  int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
> >                           struct sk_buff *skb, struct iov_iter *from,
> > -                         size_t length)
> > +                         size_t length,
> > +                         struct net_devmem_dmabuf_binding *binding)
> >  {
> >       unsigned long orig_size = skb->truesize;
> >       unsigned long truesize;
> > @@ -702,6 +744,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
> >
> >       if (msg && msg->msg_ubuf && msg->sg_from_iter)
> >               ret = msg->sg_from_iter(skb, from, length);
> > +     else if (unlikely(binding))
> > +             ret = zerocopy_fill_skb_from_devmem(skb, from, length, binding);
> >       else
> >               ret = zerocopy_fill_skb_from_iter(skb, from, length);
> >
> > @@ -735,7 +779,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
> >       if (skb_copy_datagram_from_iter(skb, 0, from, copy))
> >               return -EFAULT;
> >
> > -     return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U);
> > +     return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U, NULL);
> >  }
> >  EXPORT_SYMBOL(zerocopy_sg_from_iter);
> >
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 0cf3d189f06c..393e30d72dc8 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -17,6 +17,7 @@
> >  #include <net/netdev_rx_queue.h>
> >  #include <net/page_pool/helpers.h>
> >  #include <net/page_pool/memory_provider.h>
> > +#include <net/sock.h>
> >  #include <trace/events/page_pool.h>
> >
> >  #include "devmem.h"
> > @@ -54,8 +55,10 @@ static dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)
> >              ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT);
> >  }
> >
> > -void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
> > +void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
> >  {
> > +     struct net_devmem_dmabuf_binding *binding = container_of(wq, typeof(*binding), unbind_w);
> > +
> >       size_t size, avail;
> >
> >       gen_pool_for_each_chunk(binding->chunk_pool,
> > @@ -73,8 +76,10 @@ void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
> >       dma_buf_detach(binding->dmabuf, binding->attachment);
> >       dma_buf_put(binding->dmabuf);
> >       xa_destroy(&binding->bound_rxqs);
> > +     kvfree(binding->tx_vec);
> >       kfree(binding);
> >  }
> > +EXPORT_SYMBOL(__net_devmem_dmabuf_binding_free);
> >
> >  struct net_iov *
> >  net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > @@ -119,6 +124,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> >       unsigned long xa_idx;
> >       unsigned int rxq_idx;
> >
> > +     xa_erase(&net_devmem_dmabuf_bindings, binding->id);
> > +
> > +     /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
> > +      * erase.
> > +      */
> > +     synchronize_net();
> > +
> >       if (binding->list.next)
> >               list_del(&binding->list);
> >
>
> One thing forgot to mention: we should probably do the same for the
> allocation path? Move the binding->id allocation to the end of the
> routine to make sure we 'post' fully initialized bindings? Otherwise,
> net_devmem_bind_dmabuf migh race with the sendmsg?

Ah, good point. Although sane userspace will wait for the bind to
finish to get the id, and then pass the id to sendmsg. Only userspace
looking for trouble will be able to trigger any race here, but yes we
should handle that a bit better.

--
Thanks,
Mina

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ