lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <df3045c3ec7a4b3c417699ff4950d3d977a0a944.camel@redhat.com>
Date: Tue, 16 Jan 2024 11:49:12 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: jmaloy@...hat.com, netdev@...r.kernel.org, davem@...emloft.net
Cc: kuba@...nel.org, passt-dev@...st.top, sbrivio@...hat.com,
 lvivier@...hat.com,  dgibson@...hat.com
Subject: Re: [RFC net-next] tcp: add support for read with offset when using
 MSG_PEEK

On Thu, 2024-01-11 at 18:00 -0500, jmaloy@...hat.com wrote:
> From: Jon Maloy <jmaloy@...hat.com>
> 
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
> 
> In this commit, we allow the user to set iovec.iov_base in the first
> vector entry to NULL. This tells the socket to skip the first entry,
> hence letting the iov_len field of that entry indicate the offset value.
> This way, there is no need to add any new arguments or flags.
> 
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of ~15 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
> 
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
> 
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
> 
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
> 
> passt and pasta are supported in KubeVirt and libvirt/qemu.
> 
> jmaloy@...yr:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset not supported by kernel.
> 
> jmaloy@...yr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
> [  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
> [  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
> [  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
> [  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
> [  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
> [  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
> [  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
> [  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
> [  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
> [  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@...yr:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@...yr:~/passt$
> 
> jmaloy@...yr:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset supported by kernel.
> 
> jmaloy@...yr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 40854
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.22 GBytes  10.5 Gbits/sec
> [  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
> [  5]   2.00-3.00   sec  1.22 GBytes  10.5 Gbits/sec
> [  5]   3.00-4.00   sec  1.11 GBytes  9.56 Gbits/sec
> [  5]   4.00-5.00   sec  1.20 GBytes  10.3 Gbits/sec
> [  5]   5.00-6.00   sec  1.14 GBytes  9.80 Gbits/sec
> [  5]   6.00-7.00   sec  1.17 GBytes  10.0 Gbits/sec
> [  5]   7.00-8.00   sec  1.12 GBytes  9.61 Gbits/sec
> [  5]   8.00-9.00   sec  1.13 GBytes  9.74 Gbits/sec
> [  5]   9.00-10.00  sec  1.26 GBytes  10.8 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  11.8 GBytes  10.1 Gbits/sec   receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@...yr:~/passt$
> 
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
> 
> Without offset support:
> ----------------------
> jmaloy@...yr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
>     46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> 
> With offset support:
> ----------------------
> jmaloy@...yr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
>    27.24%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> 
> Reviewed-by: Stefano Brivio <sbrivio@...hat.com>
> Signed-off-by: Jon Maloy <jmaloy@...hat.com>
> ---
>  net/ipv4/tcp.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1baa484d2190..82e1da3f0f98 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2351,6 +2351,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>  	if (flags & MSG_PEEK) {
>  		peek_seq = tp->copied_seq;
>  		seq = &peek_seq;
> +		if (!msg->msg_iter.__iov[0].iov_base) {
> +			size_t peek_offset;
> +
> +			if (msg->msg_iter.nr_segs < 2) {
> +				err = -EINVAL;
> +				goto out;
> +			}
> +			peek_offset = msg->msg_iter.__iov[0].iov_len;
> +			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
> +			msg->msg_iter.nr_segs -= 1;
> +			msg->msg_iter.count -= peek_offset;
> +			len -= peek_offset;
> +			*seq += peek_offset;
> +		}

IMHO this does not look like the correct interface to expose such
functionality. Doing the same with a different protocol should cause a
SIGSEG or the like, right?

What about using/implementing SO_PEEK_OFF support instead? 

Cheers,

Paolo


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ