linux-kernel - Re: [PATCH RFC net-next v1 5/5] net: devmem: Implement TX path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izNzbEi_Dn+hDohF9Go=et7kts-BnmEpq=Znpot7o7B5wA@mail.gmail.com>
Date: Mon, 27 Jan 2025 16:06:21 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-doc@...r.kernel.org, virtualization@...ts.linux.dev, 
	kvm@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Donald Hunter <donald.hunter@...il.com>, Jonathan Corbet <corbet@....net>, 
	Andrew Lunn <andrew+netdev@...n.ch>, David Ahern <dsahern@...nel.org>, 
	"Michael S. Tsirkin" <mst@...hat.com>, Jason Wang <jasowang@...hat.com>, 
	Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, Eugenio Pérez <eperezma@...hat.com>, 
	Stefan Hajnoczi <stefanha@...hat.com>, Stefano Garzarella <sgarzare@...hat.com>, Shuah Khan <shuah@...nel.org>, 
	Kaiyuan Zhang <kaiyuanz@...gle.com>, Pavel Begunkov <asml.silence@...il.com>, 
	Willem de Bruijn <willemb@...gle.com>, Samiullah Khawaja <skhawaja@...gle.com>, 
	Stanislav Fomichev <sdf@...ichev.me>, Joe Damato <jdamato@...tly.com>, dw@...idwei.uk
Subject: Re: [PATCH RFC net-next v1 5/5] net: devmem: Implement TX path

Hi Willem, sorry for the late reply, some holiday vacations and other
priorities pulled me away from this a bit. I'm getting ready to post
RFC V2, so answering some questions ahead of when I do that.

On Thu, Dec 26, 2024 at 1:52 PM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
...
> > +static int zerocopy_fill_skb_from_devmem(struct sk_buff *skb,
> > +                                      struct msghdr *msg,
> > +                                      struct iov_iter *from, int length)
> > +{
> > +     int i = skb_shinfo(skb)->nr_frags;
> > +     int orig_length = length;
> > +     netmem_ref netmem;
> > +     size_t size;
> > +
> > +     while (length && iov_iter_count(from)) {
> > +             if (i == MAX_SKB_FRAGS)
> > +                     return -EMSGSIZE;
> > +
> > +             size = min_t(size_t, iter_iov_len(from), length);
> > +             if (!size)
> > +                     return -EFAULT;
>
> On error, should caller skb_zerocopy_iter_stream rewind from, rather
> than (or as well as) msg->msg_iter?

Ah, so this was confusing, because there were 2 iterators to keep
track off, (a) binding->tx_iter, which is `from` and (b)
msg->msg_iter.

Stan suggested removing binding->tx_iter entirely, so that we're back
to using only 1 iterator, which is msg->msg_iter. That does simplify
the code greatly, and I think addresses this comment as well, because
there will be no need to make sure from is advanced/reverted with
msg->msg_iter.

> > +
> > +             netmem = net_iov_to_netmem(iter_iov(from)->iov_base);
> > +             get_netmem(netmem);
> > +             skb_add_rx_frag_netmem(skb, i, netmem, from->iov_offset, size,
> > +                                    PAGE_SIZE);
> > +
> > +             iov_iter_advance(from, size);
> > +             length -= size;
> > +             i++;
> > +     }
> > +
> > +     iov_iter_advance(&msg->msg_iter, orig_length);
>
> What does this do if sendmsg is called with NULL as buffer?

So even if iov_base == NULL, the iterator is created anyhow. The
iterator will be from addresses 0 -> iov_len.

In the next iteration, I've applied Stan's suggestion to use iov_base
as the offset into the dma-buf to send from. I think it ends up being
a much cleaner UAPI, but let me know what you think.

...

> >  struct net_iov *
> >  net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > @@ -109,6 +112,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> >       unsigned long xa_idx;
> >       unsigned int rxq_idx;
> >
> > +     xa_erase(&net_devmem_dmabuf_bindings, binding->id);
> > +
> > +     /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
> > +      * erase.
> > +      */
> > +     synchronize_net();
> > +
>
> What precisely does this protect?
>
> synchronize_net() ensures no packet is in flight inside an rcu
> readside section. But a packet can still be in flight, such as posted
> to the device or queued in a qdisc.
>

The TX data path does a net_devmem_lookup_dmabuf() to lookup the
dmabuf_id provided by the user.

But that dmabuf_id may be unbind'd via net_devmem_unbind_dmabuf () by
the user at any time, so some synchronization is needed to make sure
we don't do a send from a dmabuf that is being freed in another
thread.

The synchronization in this patch is such that the lookup path obtains
a reference under rcu lock, and the unbind control path makes sure to
wait a full RCU grace period before dropping reference via
net_devmem_dmabuf_binding_put(). net_devmem_dmabuf_binding_put() will
trigger freeing the binding if the refcount hits zero.

...

> > +struct net_devmem_dmabuf_binding *
> > +net_devmem_get_sockc_binding(struct sock *sk, struct sockcm_cookie *sockc)
> > +{
> > +     struct net_devmem_dmabuf_binding *binding;
> > +     int err = 0;
> > +
> > +     binding = net_devmem_lookup_dmabuf(sockc->dmabuf_id);
>
> This lookup is from global xarray net_devmem_dmabuf_bindings.
>
> Is there a check that the socket is sending out through the device
> to which this dmabuf was bound with netlink? Should there be?
> (e.g., SO_BINDTODEVICE).
>

Yes, I think it may be an issue if the user triggers a send from a
different netdevice, because indeed when we bind a dmabuf we bind it
to a specific netdevice.

One option is as you say to require TX sockets to be bound and to
check that we're bound to the correct netdev. I also wonder if I can
make this work without SO_BINDTODEVICE, by querying the netdev the
sock is currently trying to send out on and doing a check in the
tcp_sendmsg. I'm not sure if this is possible but I'll give it a look.

--
Thanks,
Mina