[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+nmkrXRzmHjO=2ioK-PsKMuhKGLbbV9QWSXw=hJ1EY6w@mail.gmail.com>
Date: Mon, 8 Apr 2024 11:46:34 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: jmaloy@...hat.com
Cc: netdev@...r.kernel.org, davem@...emloft.net, kuba@...nel.org,
passt-dev@...st.top, sbrivio@...hat.com, lvivier@...hat.com,
dgibson@...hat.com, eric.dumazet@...il.com
Subject: Re: [net-next 1/2] tcp: add support for SO_PEEK_OFF socket option
On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@...hat.com> wrote:
>
> From: Jon Maloy <jmaloy@...hat.com>
>
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
>
> In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
> in a similar way it is done for Unix Domain sockets.
>
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of 15-20 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
>
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
>
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
>
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
>
> passt and pasta are supported in KubeVirt and libvirt/qemu.
>
> j
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 52084
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec
> [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
> [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec
> [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec
> [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec
> [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec
> [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec
> [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec
> [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec
> [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec
> [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@...yr:~/passt$
>
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
>
> Without offset support:
> ----------------------
> jmaloy@...yr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
> -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> With offset support:
> ----------------------
> jmaloy@...yr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
> -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> Suggested-by: Paolo Abeni <pabeni@...hat.com>
> Reviewed-by: Stefano Brivio <sbrivio@...hat.com>
> Signed-off-by: Jon Maloy <jmaloy@...hat.com>
>
> ---
> v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni
> v4: - Same as v3. Posting was delayed because I first had to debug
> an issue that turned out to not be directly related to this
> change. See next commit in this series.
This other issue is orthogonal, and might take more time.
SO_RCVLOWAT had a similar issue, please take a look at what we did there.
If you need SO_PEEK_OFF support, I would suggest you submit this patch
as a standalone one.
Reviewed-by: Eric Dumazet <edumazet@...gle.com>
Thanks.
Powered by blists - more mailing lists