[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izMefrkHf9WXerrOY4Wo8U2KmxSVkgY+4JB+6iDuoCZ3WQ@mail.gmail.com>
Date: Fri, 2 May 2025 12:20:59 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, io-uring@...r.kernel.org,
virtualization@...ts.linux.dev, kvm@...r.kernel.org,
linux-kselftest@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Simon Horman <horms@...nel.org>,
Donald Hunter <donald.hunter@...il.com>, Jonathan Corbet <corbet@....net>,
Andrew Lunn <andrew+netdev@...n.ch>, Jeroen de Borst <jeroendb@...gle.com>,
Harshitha Ramamurthy <hramamurthy@...gle.com>, Kuniyuki Iwashima <kuniyu@...zon.com>,
Willem de Bruijn <willemb@...gle.com>, Jens Axboe <axboe@...nel.dk>,
Pavel Begunkov <asml.silence@...il.com>, David Ahern <dsahern@...nel.org>,
Neal Cardwell <ncardwell@...gle.com>, "Michael S. Tsirkin" <mst@...hat.com>, Jason Wang <jasowang@...hat.com>,
Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, Eugenio Pérez <eperezma@...hat.com>,
Stefan Hajnoczi <stefanha@...hat.com>, Stefano Garzarella <sgarzare@...hat.com>, Shuah Khan <shuah@...nel.org>,
sdf@...ichev.me, dw@...idwei.uk, Jamal Hadi Salim <jhs@...atatu.com>,
Victor Nogueira <victor@...atatu.com>, Pedro Tammela <pctammela@...atatu.com>,
Samiullah Khawaja <skhawaja@...gle.com>, Kaiyuan Zhang <kaiyuanz@...gle.com>
Subject: Re: [PATCH net-next v13 4/9] net: devmem: Implement TX path
On Fri, May 2, 2025 at 4:47 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> Hi,
>
> On 4/29/25 5:26 AM, Mina Almasry wrote:
> > Augment dmabuf binding to be able to handle TX. Additional to all the RX
> > binding, we also create tx_vec needed for the TX path.
> >
> > Provide API for sendmsg to be able to send dmabufs bound to this device:
> >
> > - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
> > - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.
> >
> > Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
> > implementation, while disabling instances where MSG_ZEROCOPY falls back
> > to copying.
> >
> > We additionally pipe the binding down to the new
> > zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
> > instead of the traditional page netmems.
> >
> > We also special case skb_frag_dma_map to return the dma-address of these
> > dmabuf net_iovs instead of attempting to map pages.
> >
> > The TX path may release the dmabuf in a context where we cannot wait.
> > This happens when the user unbinds a TX dmabuf while there are still
> > references to its netmems in the TX path. In that case, the netmems will
> > be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
> > this by making __net_devmem_dmabuf_binding_free schedule_work'd.
> >
> > Based on work by Stanislav Fomichev <sdf@...ichev.me>. A lot of the meat
> > of the implementation came from devmem TCP RFC v1[1], which included the
> > TX path, but Stan did all the rebasing on top of netmem/net_iov.
> >
> > Cc: Stanislav Fomichev <sdf@...ichev.me>
> > Signed-off-by: Kaiyuan Zhang <kaiyuanz@...gle.com>
> > Signed-off-by: Mina Almasry <almasrymina@...gle.com>
> > Acked-by: Stanislav Fomichev <sdf@...ichev.me>
>
> I'm sorry for the late feedback. A bunch of things I did not notice
> before...
>
> > @@ -701,6 +743,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
> >
> > if (msg && msg->msg_ubuf && msg->sg_from_iter)
> > ret = msg->sg_from_iter(skb, from, length);
> > + else if (unlikely(binding))
>
> I'm unsure if the unlikely() here (and in similar tests below) it's
> worthy: depending on the actual workload this condition could be very
> likely.
>
Right, for now I'm guessing the MSG_ZEROCOPY use case in the else is
more common workload, and putting the devmem use case in the unlikely
path so I don't regress other use cases. We could revisit this in the
future. In my tests, the devmem workload doesn't seem affected by this
unlikely.
> [...]
> > @@ -1066,11 +1067,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > int flags, err, copied = 0;
> > int mss_now = 0, size_goal, copied_syn = 0;
> > int process_backlog = 0;
> > + bool sockc_valid = true;
> > int zc = 0;
> > long timeo;
> >
> > flags = msg->msg_flags;
> >
> > + sockc = (struct sockcm_cookie){ .tsflags = READ_ONCE(sk->sk_tsflags),
> > + .dmabuf_id = 0 };
>
> the '.dmabuf_id = 0' part is not needed, and possibly the code is
> clearer without it.
>
> > + if (msg->msg_controllen) {
> > + err = sock_cmsg_send(sk, msg, &sockc);
> > + if (unlikely(err))
> > + /* Don't return error until MSG_FASTOPEN has been
> > + * processed; that may succeed even if the cmsg is
> > + * invalid.
> > + */
> > + sockc_valid = false;
> > + }
> > +
> > if ((flags & MSG_ZEROCOPY) && size) {
> > if (msg->msg_ubuf) {
> > uarg = msg->msg_ubuf;
> > @@ -1078,7 +1092,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > zc = MSG_ZEROCOPY;
> > } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
> > skb = tcp_write_queue_tail(sk);
> > - uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
> > + uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb),
> > + sockc_valid && !!sockc.dmabuf_id);
>
> If sock_cmsg_send() failed and the user did not provide a dmabuf_id,
> memory accounting will be incorrect.
>
Forgive me but I don't see it. sockc_valid will be false, so
msg_zerocopy_realloc will do the normal MSG_ZEROCOPY accounting. Then
below whech check sockc_valid in place of where we did the
sock_cmsg_send before, and goto err. I assume the goto err should undo
the memory accounting in the new code as in the old code. Can you
elaborate on the bug you see?
> > if (!uarg) {
> > err = -ENOBUFS;
> > goto out_err;
> > @@ -1087,12 +1102,27 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > zc = MSG_ZEROCOPY;
> > else
> > uarg_to_msgzc(uarg)->zerocopy = 0;
> > +
> > + if (sockc_valid && sockc.dmabuf_id) {
> > + binding = net_devmem_get_binding(sk, sockc.dmabuf_id);
> > + if (IS_ERR(binding)) {
> > + err = PTR_ERR(binding);
> > + binding = NULL;
> > + goto out_err;
> > + }
> > + }
> > }
> > } else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) {
> > if (sk->sk_route_caps & NETIF_F_SG)
> > zc = MSG_SPLICE_PAGES;
> > }
> >
> > + if (sockc_valid && sockc.dmabuf_id &&
> > + (!(flags & MSG_ZEROCOPY) || !sock_flag(sk, SOCK_ZEROCOPY))) {
> > + err = -EINVAL;
> > + goto out_err;
> > + }
> > +
> > if (unlikely(flags & MSG_FASTOPEN ||
> > inet_test_bit(DEFER_CONNECT, sk)) &&
> > !tp->repair) {
> > @@ -1131,14 +1161,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > /* 'common' sending to sendq */
> > }
> >
> > - sockc = (struct sockcm_cookie) { .tsflags = READ_ONCE(sk->sk_tsflags)};
> > - if (msg->msg_controllen) {
> > - err = sock_cmsg_send(sk, msg, &sockc);
> > - if (unlikely(err)) {
> > - err = -EINVAL;
> > - goto out_err;
> > - }
> > - }
> > + if (!sockc_valid)
> > + goto out_err;
>
> Here 'err' could have been zeroed by tcp_sendmsg_fastopen(), and out_err
> could emit a wrong return value.
>
Good point indeed.
> Possibly it's better to keep the 'dmabuf_id' initialization out of
> sock_cmsg_send() in a separate helper could simplify the handling here?
>
This should be possible as well.
--
Thanks,
Mina
Powered by blists - more mailing lists