[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izNzbEi_Dn+hDohF9Go=et7kts-BnmEpq=Znpot7o7B5wA@mail.gmail.com>
Date: Mon, 27 Jan 2025 16:06:21 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, virtualization@...ts.linux.dev,
kvm@...r.kernel.org, linux-kselftest@...r.kernel.org,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Donald Hunter <donald.hunter@...il.com>, Jonathan Corbet <corbet@....net>,
Andrew Lunn <andrew+netdev@...n.ch>, David Ahern <dsahern@...nel.org>,
"Michael S. Tsirkin" <mst@...hat.com>, Jason Wang <jasowang@...hat.com>,
Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, Eugenio Pérez <eperezma@...hat.com>,
Stefan Hajnoczi <stefanha@...hat.com>, Stefano Garzarella <sgarzare@...hat.com>, Shuah Khan <shuah@...nel.org>,
Kaiyuan Zhang <kaiyuanz@...gle.com>, Pavel Begunkov <asml.silence@...il.com>,
Willem de Bruijn <willemb@...gle.com>, Samiullah Khawaja <skhawaja@...gle.com>,
Stanislav Fomichev <sdf@...ichev.me>, Joe Damato <jdamato@...tly.com>, dw@...idwei.uk
Subject: Re: [PATCH RFC net-next v1 5/5] net: devmem: Implement TX path
Hi Willem, sorry for the late reply, some holiday vacations and other
priorities pulled me away from this a bit. I'm getting ready to post
RFC V2, so answering some questions ahead of when I do that.
On Thu, Dec 26, 2024 at 1:52 PM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
...
> > +static int zerocopy_fill_skb_from_devmem(struct sk_buff *skb,
> > + struct msghdr *msg,
> > + struct iov_iter *from, int length)
> > +{
> > + int i = skb_shinfo(skb)->nr_frags;
> > + int orig_length = length;
> > + netmem_ref netmem;
> > + size_t size;
> > +
> > + while (length && iov_iter_count(from)) {
> > + if (i == MAX_SKB_FRAGS)
> > + return -EMSGSIZE;
> > +
> > + size = min_t(size_t, iter_iov_len(from), length);
> > + if (!size)
> > + return -EFAULT;
>
> On error, should caller skb_zerocopy_iter_stream rewind from, rather
> than (or as well as) msg->msg_iter?
Ah, so this was confusing, because there were 2 iterators to keep
track off, (a) binding->tx_iter, which is `from` and (b)
msg->msg_iter.
Stan suggested removing binding->tx_iter entirely, so that we're back
to using only 1 iterator, which is msg->msg_iter. That does simplify
the code greatly, and I think addresses this comment as well, because
there will be no need to make sure from is advanced/reverted with
msg->msg_iter.
> > +
> > + netmem = net_iov_to_netmem(iter_iov(from)->iov_base);
> > + get_netmem(netmem);
> > + skb_add_rx_frag_netmem(skb, i, netmem, from->iov_offset, size,
> > + PAGE_SIZE);
> > +
> > + iov_iter_advance(from, size);
> > + length -= size;
> > + i++;
> > + }
> > +
> > + iov_iter_advance(&msg->msg_iter, orig_length);
>
> What does this do if sendmsg is called with NULL as buffer?
So even if iov_base == NULL, the iterator is created anyhow. The
iterator will be from addresses 0 -> iov_len.
In the next iteration, I've applied Stan's suggestion to use iov_base
as the offset into the dma-buf to send from. I think it ends up being
a much cleaner UAPI, but let me know what you think.
...
> > struct net_iov *
> > net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > @@ -109,6 +112,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
> > unsigned long xa_idx;
> > unsigned int rxq_idx;
> >
> > + xa_erase(&net_devmem_dmabuf_bindings, binding->id);
> > +
> > + /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
> > + * erase.
> > + */
> > + synchronize_net();
> > +
>
> What precisely does this protect?
>
> synchronize_net() ensures no packet is in flight inside an rcu
> readside section. But a packet can still be in flight, such as posted
> to the device or queued in a qdisc.
>
The TX data path does a net_devmem_lookup_dmabuf() to lookup the
dmabuf_id provided by the user.
But that dmabuf_id may be unbind'd via net_devmem_unbind_dmabuf () by
the user at any time, so some synchronization is needed to make sure
we don't do a send from a dmabuf that is being freed in another
thread.
The synchronization in this patch is such that the lookup path obtains
a reference under rcu lock, and the unbind control path makes sure to
wait a full RCU grace period before dropping reference via
net_devmem_dmabuf_binding_put(). net_devmem_dmabuf_binding_put() will
trigger freeing the binding if the refcount hits zero.
...
> > +struct net_devmem_dmabuf_binding *
> > +net_devmem_get_sockc_binding(struct sock *sk, struct sockcm_cookie *sockc)
> > +{
> > + struct net_devmem_dmabuf_binding *binding;
> > + int err = 0;
> > +
> > + binding = net_devmem_lookup_dmabuf(sockc->dmabuf_id);
>
> This lookup is from global xarray net_devmem_dmabuf_bindings.
>
> Is there a check that the socket is sending out through the device
> to which this dmabuf was bound with netlink? Should there be?
> (e.g., SO_BINDTODEVICE).
>
Yes, I think it may be an issue if the user triggers a send from a
different netdevice, because indeed when we bind a dmabuf we bind it
to a specific netdevice.
One option is as you say to require TX sockets to be bound and to
check that we're bound to the correct netdev. I also wonder if I can
make this work without SO_BINDTODEVICE, by querying the netdev the
sock is currently trying to send out on and doing a check in the
tcp_sendmsg. I'm not sure if this is possible but I'll give it a look.
--
Thanks,
Mina
Powered by blists - more mailing lists