lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53433089-7beb-46cf-ae8a-6c58cd909e31@redhat.com>
Date: Fri, 2 May 2025 13:47:04 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Mina Almasry <almasrymina@...gle.com>, netdev@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 io-uring@...r.kernel.org, virtualization@...ts.linux.dev,
 kvm@...r.kernel.org, linux-kselftest@...r.kernel.org
Cc: "David S. Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Simon Horman <horms@...nel.org>, Donald Hunter <donald.hunter@...il.com>,
 Jonathan Corbet <corbet@....net>, Andrew Lunn <andrew+netdev@...n.ch>,
 Jeroen de Borst <jeroendb@...gle.com>,
 Harshitha Ramamurthy <hramamurthy@...gle.com>,
 Kuniyuki Iwashima <kuniyu@...zon.com>, Willem de Bruijn
 <willemb@...gle.com>, Jens Axboe <axboe@...nel.dk>,
 Pavel Begunkov <asml.silence@...il.com>, David Ahern <dsahern@...nel.org>,
 Neal Cardwell <ncardwell@...gle.com>, "Michael S. Tsirkin" <mst@...hat.com>,
 Jason Wang <jasowang@...hat.com>, Xuan Zhuo <xuanzhuo@...ux.alibaba.com>,
 Eugenio Pérez <eperezma@...hat.com>,
 Stefan Hajnoczi <stefanha@...hat.com>,
 Stefano Garzarella <sgarzare@...hat.com>, Shuah Khan <shuah@...nel.org>,
 sdf@...ichev.me, dw@...idwei.uk, Jamal Hadi Salim <jhs@...atatu.com>,
 Victor Nogueira <victor@...atatu.com>, Pedro Tammela
 <pctammela@...atatu.com>, Samiullah Khawaja <skhawaja@...gle.com>,
 Kaiyuan Zhang <kaiyuanz@...gle.com>
Subject: Re: [PATCH net-next v13 4/9] net: devmem: Implement TX path

Hi,

On 4/29/25 5:26 AM, Mina Almasry wrote:
> Augment dmabuf binding to be able to handle TX. Additional to all the RX
> binding, we also create tx_vec needed for the TX path.
> 
> Provide API for sendmsg to be able to send dmabufs bound to this device:
> 
> - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
> - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.
> 
> Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
> implementation, while disabling instances where MSG_ZEROCOPY falls back
> to copying.
> 
> We additionally pipe the binding down to the new
> zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
> instead of the traditional page netmems.
> 
> We also special case skb_frag_dma_map to return the dma-address of these
> dmabuf net_iovs instead of attempting to map pages.
> 
> The TX path may release the dmabuf in a context where we cannot wait.
> This happens when the user unbinds a TX dmabuf while there are still
> references to its netmems in the TX path. In that case, the netmems will
> be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
> this by making __net_devmem_dmabuf_binding_free schedule_work'd.
> 
> Based on work by Stanislav Fomichev <sdf@...ichev.me>. A lot of the meat
> of the implementation came from devmem TCP RFC v1[1], which included the
> TX path, but Stan did all the rebasing on top of netmem/net_iov.
> 
> Cc: Stanislav Fomichev <sdf@...ichev.me>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@...gle.com>
> Signed-off-by: Mina Almasry <almasrymina@...gle.com>
> Acked-by: Stanislav Fomichev <sdf@...ichev.me>

I'm sorry for the late feedback. A bunch of things I did not notice
before...

> @@ -701,6 +743,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>  
>  	if (msg && msg->msg_ubuf && msg->sg_from_iter)
>  		ret = msg->sg_from_iter(skb, from, length);
> +	else if (unlikely(binding))

I'm unsure if the unlikely() here (and in similar tests below) it's
worthy: depending on the actual workload this condition could be very
likely.

[...]
> @@ -1066,11 +1067,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  	int flags, err, copied = 0;
>  	int mss_now = 0, size_goal, copied_syn = 0;
>  	int process_backlog = 0;
> +	bool sockc_valid = true;
>  	int zc = 0;
>  	long timeo;
>  
>  	flags = msg->msg_flags;
>  
> +	sockc = (struct sockcm_cookie){ .tsflags = READ_ONCE(sk->sk_tsflags),
> +					.dmabuf_id = 0 };

the '.dmabuf_id = 0' part is not needed, and possibly the code is
clearer without it.

> +	if (msg->msg_controllen) {
> +		err = sock_cmsg_send(sk, msg, &sockc);
> +		if (unlikely(err))
> +			/* Don't return error until MSG_FASTOPEN has been
> +			 * processed; that may succeed even if the cmsg is
> +			 * invalid.
> +			 */
> +			sockc_valid = false;
> +	}
> +
>  	if ((flags & MSG_ZEROCOPY) && size) {
>  		if (msg->msg_ubuf) {
>  			uarg = msg->msg_ubuf;
> @@ -1078,7 +1092,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  				zc = MSG_ZEROCOPY;
>  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>  			skb = tcp_write_queue_tail(sk);
> -			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
> +			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb),
> +						    sockc_valid && !!sockc.dmabuf_id);

If sock_cmsg_send() failed and the user did not provide a dmabuf_id,
memory accounting will be incorrect.

>  			if (!uarg) {
>  				err = -ENOBUFS;
>  				goto out_err;
> @@ -1087,12 +1102,27 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  				zc = MSG_ZEROCOPY;
>  			else
>  				uarg_to_msgzc(uarg)->zerocopy = 0;
> +
> +			if (sockc_valid && sockc.dmabuf_id) {
> +				binding = net_devmem_get_binding(sk, sockc.dmabuf_id);
> +				if (IS_ERR(binding)) {
> +					err = PTR_ERR(binding);
> +					binding = NULL;
> +					goto out_err;
> +				}
> +			}
>  		}
>  	} else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) {
>  		if (sk->sk_route_caps & NETIF_F_SG)
>  			zc = MSG_SPLICE_PAGES;
>  	}
>  
> +	if (sockc_valid && sockc.dmabuf_id &&
> +	    (!(flags & MSG_ZEROCOPY) || !sock_flag(sk, SOCK_ZEROCOPY))) {
> +		err = -EINVAL;
> +		goto out_err;
> +	}
> +
>  	if (unlikely(flags & MSG_FASTOPEN ||
>  		     inet_test_bit(DEFER_CONNECT, sk)) &&
>  	    !tp->repair) {
> @@ -1131,14 +1161,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  		/* 'common' sending to sendq */
>  	}
>  
> -	sockc = (struct sockcm_cookie) { .tsflags = READ_ONCE(sk->sk_tsflags)};
> -	if (msg->msg_controllen) {
> -		err = sock_cmsg_send(sk, msg, &sockc);
> -		if (unlikely(err)) {
> -			err = -EINVAL;
> -			goto out_err;
> -		}
> -	}
> +	if (!sockc_valid)
> +		goto out_err;

Here 'err' could have been zeroed by tcp_sendmsg_fastopen(), and out_err
could emit a wrong return value.

Possibly it's better to keep the 'dmabuf_id' initialization out of
sock_cmsg_send() in a separate helper could simplify the handling here?

/P


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ