[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izOE-JzMszieHEXtYBs7_6D-ngVx2kJyMwp8eCWLK-c0cQ@mail.gmail.com>
Date: Wed, 12 Feb 2025 11:18:50 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Pavel Begunkov <asml.silence@...il.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, kvm@...r.kernel.org,
virtualization@...ts.linux.dev, linux-kselftest@...r.kernel.org,
Donald Hunter <donald.hunter@...il.com>, Jakub Kicinski <kuba@...nel.org>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, Jonathan Corbet <corbet@....net>,
Andrew Lunn <andrew+netdev@...n.ch>, Neal Cardwell <ncardwell@...gle.com>,
David Ahern <dsahern@...nel.org>, "Michael S. Tsirkin" <mst@...hat.com>, Jason Wang <jasowang@...hat.com>,
Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, Eugenio Pérez <eperezma@...hat.com>,
Stefan Hajnoczi <stefanha@...hat.com>, Stefano Garzarella <sgarzare@...hat.com>, Shuah Khan <shuah@...nel.org>,
sdf@...ichev.me, dw@...idwei.uk, Jamal Hadi Salim <jhs@...atatu.com>,
Victor Nogueira <victor@...atatu.com>, Pedro Tammela <pctammela@...atatu.com>,
Samiullah Khawaja <skhawaja@...gle.com>, Kaiyuan Zhang <kaiyuanz@...gle.com>
Subject: Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
On Wed, Feb 12, 2025 at 7:52 AM Pavel Begunkov <asml.silence@...il.com> wrote:
>
> On 2/10/25 21:09, Mina Almasry wrote:
> > On Wed, Feb 5, 2025 at 4:20 AM Pavel Begunkov <asml.silence@...il.com> wrote:
> >>
> >> On 2/3/25 22:39, Mina Almasry wrote:
> >> ...
> >>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >>> index bb2b751d274a..3ff8f568c382 100644
> >>> --- a/include/linux/skbuff.h
> >>> +++ b/include/linux/skbuff.h
> >>> @@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> >> ...
> >>> int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >>> struct iov_iter *from, size_t length);
> >>> @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >>> static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
> >>> struct msghdr *msg, int len)
> >>> {
> >>> - return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
> >>> + return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
> >>> + NULL);
> >>
> >> Instead of propagating it all the way down and carving a new path, why
> >> not reuse the existing infra? You already hook into where ubuf is
> >> allocated, you can stash the binding in there. And
> >
> > It looks like it's not possible to increase the side of ubuf_info at
> > all, otherwise the BUILD_BUG_ON in msg_zerocopy_alloc() fires.
> >
> > It's asserting that sizeof(ubuf_info_msgzc) <= sizeof(skb->cb), and
> > I'm guessing increasing skb->cb size is not really the way to go.
> >
> > What I may be able to do here is stash the binding somewhere in
> > ubuf_info_msgzc via union with fields we don't need for devmem, and/or
>
> It doesn't need to account the memory against the user, and you
> actually don't want that because dmabuf should take care of that.
> So, it should be fine to reuse ->mmp.
>
> It's also not a real sk_buff, so maybe maintainers wouldn't mind
> reusing some more space out of it, if that would even be needed.
>
netmem skb are real sk_buff, with the modification that frags are not
readable, only in the case that the netmem is unreadable. I would not
approve of considering netmem/devmem skbs "not real skbs", and start
messing with the semantics of skb fields for devmem skbs, and having
to start adding skb_is_devmem() checks through all code in the skb
handlers that touch the fields being overwritten in the devmem case.
No, I don't think we can re-use random fields in the skb for devmem.
> > stashing the binding in ubuf_info_ops (very hacky). Neither approach
> > seems ideal, but the former may work and may be cleaner.
> >
> > I'll take a deeper look here. I had looked before and concluded that
> > we're piggybacking devmem TX on MSG_ZEROCOPY path, because we need
> > almost all of the functionality there (no copying, send complete
> > notifications, etc), with one minor change in the skb filling. I had
> > concluded that if MSG_ZEROCOPY was never updated to use the existing
> > infra, then it's appropriate for devmem TX piggybacking on top of it
>
> MSG_ZEROCOPY does use the common infra, i.e. passing ubuf_info,
> but doesn't need ->sg_from_iter as zerocopy_fill_skb_from_iter()
> and it's what was there first.
>
But MSG_ZEROCOPY doesn't set msg->msg_ubuf. And not setting
msg->msg_ubuf fails to trigger msg->sg_from_iter altogether.
And also currently sg_from_iter isn't set up to take in a ubuf_info.
We'd need that if we stash the binding in the ubuf_info.
All in all I think I wanna prototype an msg->sg_from_iter approach and
make a judgement call on whether it's cleaner than just passing the
binding through a couple of helpers just as I'm doing here. My feeling
is that the implementation in this patch may be cleaner than
refactoring the entire msg_ubuf/sg_from_iter flows so we can sort of
use it for MSG_ZEROCOPY with devmem when it currently doesn't use it.
> > to follow that. I would not want to get into a refactor of
> > MSG_ZEROCOPY for no real reason.
> >
> > But I'll take a deeper look here and see if I can make something
> > slightly cleaner work.
> >
> >> zerocopy_fill_skb_from_devmem can implement ->sg_from_iter,
> >> see __zerocopy_sg_from_iter().
> >>
> >> ...
> >>> diff --git a/net/core/datagram.c b/net/core/datagram.c
> >>> index f0693707aece..c989606ff58d 100644
> >>> --- a/net/core/datagram.c
> >>> +++ b/net/core/datagram.c
> >>> @@ -63,6 +63,8 @@
> >>> +static int
> >>> +zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from,
> >>> + int length,
> >>> + struct net_devmem_dmabuf_binding *binding)
> >>> +{
> >>> + int i = skb_shinfo(skb)->nr_frags;
> >>> + size_t virt_addr, size, off;
> >>> + struct net_iov *niov;
> >>> +
> >>> + while (length && iov_iter_count(from)) {
> >>> + if (i == MAX_SKB_FRAGS)
> >>> + return -EMSGSIZE;
> >>> +
> >>> + virt_addr = (size_t)iter_iov_addr(from);
> >>
> >> Unless I missed it somewhere it needs to check that the iter
> >> is iovec based.
> >>
> >
> > How do we end up here with an iterator that is not iovec based? Is the
> > user able to trigger that somehow and I missed it?
>
> Hopefully not, but for example io_uring passes bvecs for a number of
> requests that can end up in tcp_sendmsg_locked(). Those probably
> would work with the current patch, but check the order of some of the
> checks it will break. And once io_uring starts passing bvecs for
> normal send[msg] requests, it'd definitely be possible. And there
> are other in kernel users apart from send(2) path, so who knows.
>
> The api allows it and therefore should be checked, it's better to
> avoid quite possible latent bugs.
>
Sounds good.
--
Thanks,
Mina
Powered by blists - more mailing lists