lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 3 Nov 2023 07:56:45 +0100
From: Jiri Slaby <jirislaby@...nel.org>
To: Eric Dumazet <edumazet@...gle.com>, "David S . Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, Soheil Hassas Yeganeh <soheil@...gle.com>,
 Neal Cardwell <ncardwell@...gle.com>, Yuchung Cheng <ycheng@...gle.com>,
 eric.dumazet@...il.com
Subject: Re: [PATCH net-next] tcp: get rid of sysctl_tcp_adv_win_scale

On 03. 11. 23, 7:10, Jiri Slaby wrote:
> On 17. 07. 23, 17:29, Eric Dumazet wrote:
>> With modern NIC drivers shifting to full page allocations per
>> received frame, we face the following issue:
>>
>> TCP has one per-netns sysctl used to tweak how to translate
>> a memory use into an expected payload (RWIN), in RX path.
>>
>> tcp_win_from_space() implementation is limited to few cases.
>>
>> For hosts dealing with various MSS, we either under estimate
>> or over estimate the RWIN we send to the remote peers.
>>
>> For instance with the default sysctl_tcp_adv_win_scale value,
>> we expect to store 50% of payload per allocated chunk of memory.
>>
>> For the typical use of MTU=1500 traffic, and order-0 pages allocations
>> by NIC drivers, we are sending too big RWIN, leading to potential
>> tcp collapse operations, which are extremely expensive and source
>> of latency spikes.
>>
>> This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
>> uses a per socket scaling factor, so that we can precisely
>> adjust the RWIN based on effective skb->len/skb->truesize ratio.
>>
>> This patch alone can double TCP receive performance when receivers
>> are too slow to drain their receive queue, or by allowing
>> a bigger RWIN when MSS is close to PAGE_SIZE.
> 
> Hi,
> 
> I bisected a python-eventlet test failure:
>  > =================================== FAILURES 
> ===================================
>  > _______________________ TestGreenSocket.test_full_duplex 
> _______________________
>  >
>  > self = <tests.greenio_test.TestGreenSocket testMethod=test_full_duplex>
>  >
>  >     def test_full_duplex(self):
>  > ...
>  > >       large_evt.wait()
>  >
>  > tests/greenio_test.py:424:
>  > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _
>  > eventlet/greenthread.py:181: in wait
>  >     return self._exit_event.wait()
>  > eventlet/event.py:125: in wait
>  >     result = hub.switch()
> ...
>  > E       tests.TestIsTakingTooLong: 1
>  >
>  > eventlet/hubs/hub.py:313: TestIsTakingTooLong
> 
> to this commit. With the commit, the test takes > 1.5 s. Without the 
> commit it takes only < 300 ms. And they set timeout to 1 s.
> 
> The reduced self-stadning test case:
> #!/usr/bin/python3
> import eventlet
> from eventlet.green import select, socket, time, ssl
> 
> def bufsized(sock, size=1):
>      """ Resize both send and receive buffers on a socket.
>      Useful for testing trampoline.  Returns the socket.
> 
>      >>> import socket
>      >>> sock = bufsized(socket.socket(socket.AF_INET, socket.SOCK_STREAM))
>      """
>      sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, size)
>      sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, size)
>      return sock
> 
> def min_buf_size():
>      """Return the minimum buffer size that the platform supports."""
>      test_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>      test_sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)
>      return test_sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
> 
> def test_full_duplex():
>      large_data = b'*' * 10 * min_buf_size()
>      listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>      listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
>      listener.bind(('127.0.0.1', 0))
>      listener.listen(50)
>      bufsized(listener)
> 
>      def send_large(sock):
>          sock.sendall(large_data)
> 
>      def read_large(sock):
>          result = sock.recv(len(large_data))
>          while len(result) < len(large_data):
>              result += sock.recv(len(large_data))
>          assert result == large_data
> 
>      def server():
>          (sock, addr) = listener.accept()
>          sock = bufsized(sock)
>          send_large_coro = eventlet.spawn(send_large, sock)
>          eventlet.sleep(0)
>          result = sock.recv(10)
>          expected = b'hello world'
>          while len(result) < len(expected):
>              result += sock.recv(10)
>          assert result == expected
>          send_large_coro.wait()
> 
>      server_evt = eventlet.spawn(server)
>      client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>      client.connect(('127.0.0.1', listener.getsockname()[1]))
>      bufsized(client)
>      large_evt = eventlet.spawn(read_large, client)
>      eventlet.sleep(0)
>      client.sendall(b'hello world')
>      server_evt.wait()
>      large_evt.wait()
>      client.close()
> 
> test_full_duplex()
> 
> =====================================
> 
> I speak neither python nor networking, so any ideas :)? Is the test 
> simply wrong?

strace -rT -e trace=network:

GOOD:
 > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3 
<0.000063>
 > 0.000406 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000042>
 > 0.000097 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0 
<0.000012>
 > 0.000101 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3 
<0.000015>
 > 0.000058 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000009>
 > 0.000035 bind(3, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000027>
 > 0.000058 listen(3, 50)       = 0 <0.000014>
 > 0.000029 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
 > 0.000023 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
 > 0.000052 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5 
<0.000014>
 > 0.000050 getsockname(3, {sa_family=AF_INET, sin_port=htons(44313), 
sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000011>
 > 0.000037 connect(5, {sa_family=AF_INET, sin_port=htons(44313), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in 
progress) <0.000070>
 > 0.000210 accept4(3, {sa_family=AF_INET, sin_port=htons(56062), 
sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000012>
 > 0.000040 getsockname(6, {sa_family=AF_INET, sin_port=htons(44313), 
sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000007>
 > 0.000062 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000007>
 > 0.000020 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
 > 0.000082 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000007>
 > 0.000023 connect(5, {sa_family=AF_INET, sin_port=htons(44313), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000008>
 > 0.000022 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000020>
 > 0.000036 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
 > 0.000061 sendto(6, "********************************"..., 46080, 0, 
NULL, 0) = 32768 <0.000049>
 > 0.000135 sendto(6, "********************************"..., 13312, 0, 
NULL, 0) = 13312 <0.000017>
 > 0.000087 recvfrom(6, 0x7f78e58af890, 10, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable) <0.000010>
 > 0.000125 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 32768 <0.000032>
 > 0.000066 recvfrom(5, 0x55fdb41bb880, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000011>
 > 0.000075 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000023>
 > 0.000117 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000015>
 > 0.000050 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000011>
 > 0.000212 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 13312 <0.000019>
 > 0.050676 +++ exited with 0 +++


BAD:
 > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3 
<0.000045>
 > 0.000244 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000015>
 > 0.000057 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0 <0.000013>
 > 0.000104 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3 
<0.000016>
 > 0.000065 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000010>
 > 0.000038 bind(3, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000031>
 > 0.000068 listen(3, 50)       = 0 <0.000014>
 > 0.000032 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000010>
 > 0.000030 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000018>
 > 0.000060 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5 
<0.000023>
 > 0.000071 getsockname(3, {sa_family=AF_INET, sin_port=htons(45901), 
sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000019>
 > 0.000068 connect(5, {sa_family=AF_INET, sin_port=htons(45901), 
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in 
progress) <0.000074>
 > 0.000259 accept4(3, {sa_family=AF_INET, sin_port=htons(35002), 
sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000014>
 > 0.000051 getsockname(6, {sa_family=AF_INET, sin_port=htons(45901), 
sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000010>
 > 0.000082 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
 > 0.000040 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000009>
 > 0.000104 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000009>
 > 0.000028 connect(5, {sa_family=AF_INET, sin_port=htons(45901), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000009>
 > 0.000026 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
 > 0.000024 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
 > 0.000071 sendto(6, "********************************"..., 46080, 0, 
NULL, 0) = 16640 <0.000026>
 > 0.000117 sendto(6, "********************************"..., 29440, 0, 
NULL, 0) = 16640 <0.000017>
 > 0.000041 sendto(6, "********************************"..., 12800, 0, 
NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000009>
 > 0.000075 recvfrom(6, 0x7f4db88a38c0, 10, 0, NULL, NULL) = -1 EAGAIN 
(Resource temporarily unavailable) <0.000010>
 > 0.000086 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 16640 <0.000018>
 > 0.000044 recvfrom(5, 0x55a64d59b2a0, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000009>
 > 0.000059 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000018>
 > 0.000093 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000009>
 > 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
 > 0.206685 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 16640 <0.000116>
 > 0.000306 recvfrom(5, 0x55a64d5a7600, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000013>
 > 0.000208 sendto(6, "********************************"..., 12800, 0, 
NULL, 0) = 12800 <0.000025>
 > 0.206317 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 2304 <0.000171>
 > 0.000304 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000029>
 > 0.206161 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 2304 <0.000082>
 > 0.000212 recvfrom(5, 0x55a64d5a0ed0, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000034>
 > 0.206572 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 2304 <0.000146>
 > 0.000274 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000029>
 > 0.206604 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 2304 <0.000162>
 > 0.000270 recvfrom(5, 0x55a64d5a20d0, 46080, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable) <0.000016>
 > 0.206164 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 2304 <0.000116>
 > 0.000291 recvfrom(5, "********************************"..., 46080, 0, 
NULL, NULL) = 1280 <0.000038>
 > 0.052224 +++ exited with 0 +++

I.e. recvfrom() returns -EAGAIN and takes 200 ms.


-- 
js
suse labs


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ