lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJOwQUwAVcofW+X_8srFcPnaWKyqOoM005L6Zgh8=OvpA@mail.gmail.com>
Date: Fri, 3 Nov 2023 09:17:27 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Jiri Slaby <jirislaby@...nel.org>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org, 
	Soheil Hassas Yeganeh <soheil@...gle.com>, Neal Cardwell <ncardwell@...gle.com>, 
	Yuchung Cheng <ycheng@...gle.com>, eric.dumazet@...il.com
Subject: Re: [PATCH net-next] tcp: get rid of sysctl_tcp_adv_win_scale

On Fri, Nov 3, 2023 at 8:07 AM Jiri Slaby <jirislaby@...nel.org> wrote:
>
> On 03. 11. 23, 7:56, Jiri Slaby wrote:
> > On 03. 11. 23, 7:10, Jiri Slaby wrote:
> >> On 17. 07. 23, 17:29, Eric Dumazet wrote:
> >>> With modern NIC drivers shifting to full page allocations per
> >>> received frame, we face the following issue:
> >>>
> >>> TCP has one per-netns sysctl used to tweak how to translate
> >>> a memory use into an expected payload (RWIN), in RX path.
> >>>
> >>> tcp_win_from_space() implementation is limited to few cases.
> >>>
> >>> For hosts dealing with various MSS, we either under estimate
> >>> or over estimate the RWIN we send to the remote peers.
> >>>
> >>> For instance with the default sysctl_tcp_adv_win_scale value,
> >>> we expect to store 50% of payload per allocated chunk of memory.
> >>>
> >>> For the typical use of MTU=1500 traffic, and order-0 pages allocations
> >>> by NIC drivers, we are sending too big RWIN, leading to potential
> >>> tcp collapse operations, which are extremely expensive and source
> >>> of latency spikes.
> >>>
> >>> This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
> >>> uses a per socket scaling factor, so that we can precisely
> >>> adjust the RWIN based on effective skb->len/skb->truesize ratio.
> >>>
> >>> This patch alone can double TCP receive performance when receivers
> >>> are too slow to drain their receive queue, or by allowing
> >>> a bigger RWIN when MSS is close to PAGE_SIZE.
> >>
> >> Hi,
> >>
> >> I bisected a python-eventlet test failure:
> >>  > =================================== FAILURES
> >> ===================================
> >>  > _______________________ TestGreenSocket.test_full_duplex
> >> _______________________
> >>  >
> >>  > self = <tests.greenio_test.TestGreenSocket
> >> testMethod=test_full_duplex>
> >>  >
> >>  >     def test_full_duplex(self):
> >>  > ...
> >>  > >       large_evt.wait()
> >>  >
> >>  > tests/greenio_test.py:424:
> >>  > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >> _ _ _ _ _ _
> >>  > eventlet/greenthread.py:181: in wait
> >>  >     return self._exit_event.wait()
> >>  > eventlet/event.py:125: in wait
> >>  >     result = hub.switch()
> >> ...
> >>  > E       tests.TestIsTakingTooLong: 1
> >>  >
> >>  > eventlet/hubs/hub.py:313: TestIsTakingTooLong
> >>
> >> to this commit. With the commit, the test takes > 1.5 s. Without the
> >> commit it takes only < 300 ms. And they set timeout to 1 s.
> >>
> >> The reduced self-stadning test case:
> >> #!/usr/bin/python3
> >> import eventlet
> >> from eventlet.green import select, socket, time, ssl
> >>
> >> def bufsized(sock, size=1):
> >>      """ Resize both send and receive buffers on a socket.
> >>      Useful for testing trampoline.  Returns the socket.
> >>
> >>      >>> import socket
> >>      >>> sock = bufsized(socket.socket(socket.AF_INET,
> >> socket.SOCK_STREAM))
> >>      """
> >>      sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, size)
> >>      sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, size)
> >>      return sock
> >>
> >> def min_buf_size():
> >>      """Return the minimum buffer size that the platform supports."""
> >>      test_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> >>      test_sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)
> >>      return test_sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
> >>
> >> def test_full_duplex():
> >>      large_data = b'*' * 10 * min_buf_size()
> >>      listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> >>      listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
> >>      listener.bind(('127.0.0.1', 0))
> >>      listener.listen(50)
> >>      bufsized(listener)
> >>
> >>      def send_large(sock):
> >>          sock.sendall(large_data)
> >>
> >>      def read_large(sock):
> >>          result = sock.recv(len(large_data))
> >>          while len(result) < len(large_data):
> >>              result += sock.recv(len(large_data))
> >>          assert result == large_data
> >>
> >>      def server():
> >>          (sock, addr) = listener.accept()
> >>          sock = bufsized(sock)
> >>          send_large_coro = eventlet.spawn(send_large, sock)
> >>          eventlet.sleep(0)
> >>          result = sock.recv(10)
> >>          expected = b'hello world'
> >>          while len(result) < len(expected):
> >>              result += sock.recv(10)
> >>          assert result == expected
> >>          send_large_coro.wait()
> >>
> >>      server_evt = eventlet.spawn(server)
> >>      client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> >>      client.connect(('127.0.0.1', listener.getsockname()[1]))
> >>      bufsized(client)
> >>      large_evt = eventlet.spawn(read_large, client)
> >>      eventlet.sleep(0)
> >>      client.sendall(b'hello world')
> >>      server_evt.wait()
> >>      large_evt.wait()
> >>      client.close()
> >>
> >> test_full_duplex()
> >>
> >> =====================================
> >>
> >> I speak neither python nor networking, so any ideas :)? Is the test
> >> simply wrong?
> >
> > strace -rT -e trace=network:
> >
> > GOOD:
> >  > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> > <0.000063>
> >  > 0.000406 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000042>
> >  > 0.000097 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0
> > <0.000012>
> >  > 0.000101 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> > <0.000015>
> >  > 0.000058 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000009>
> >  > 0.000035 bind(3, {sa_family=AF_INET, sin_port=htons(0),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000027>
> >  > 0.000058 listen(3, 50)       = 0 <0.000014>
> >  > 0.000029 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> >  > 0.000023 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
> >  > 0.000052 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5
> > <0.000014>
> >  > 0.000050 getsockname(3, {sa_family=AF_INET, sin_port=htons(44313),
> > sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000011>
> >  > 0.000037 connect(5, {sa_family=AF_INET, sin_port=htons(44313),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in
> > progress) <0.000070>
> >  > 0.000210 accept4(3, {sa_family=AF_INET, sin_port=htons(56062),
> > sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000012>
> >  > 0.000040 getsockname(6, {sa_family=AF_INET, sin_port=htons(44313),
> > sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000007>
> >  > 0.000062 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000007>
> >  > 0.000020 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
> >  > 0.000082 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000007>
> >  > 0.000023 connect(5, {sa_family=AF_INET, sin_port=htons(44313),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000008>
> >  > 0.000022 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000020>
> >  > 0.000036 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
> >  > 0.000061 sendto(6, "********************************"..., 46080, 0,
> > NULL, 0) = 32768 <0.000049>
> >  > 0.000135 sendto(6, "********************************"..., 13312, 0,
> > NULL, 0) = 13312 <0.000017>
> >  > 0.000087 recvfrom(6, 0x7f78e58af890, 10, 0, NULL, NULL) = -1 EAGAIN
> > (Resource temporarily unavailable) <0.000010>
> >  > 0.000125 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 32768 <0.000032>
> >  > 0.000066 recvfrom(5, 0x55fdb41bb880, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000011>
> >  > 0.000075 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000023>
> >  > 0.000117 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000015>
> >  > 0.000050 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000011>
> >  > 0.000212 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 13312 <0.000019>
> >  > 0.050676 +++ exited with 0 +++
> >
> >
> > BAD:
> >  > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> > <0.000045>
> >  > 0.000244 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000015>
> >  > 0.000057 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0
> > <0.000013>
> >  > 0.000104 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> > <0.000016>
> >  > 0.000065 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000010>
> >  > 0.000038 bind(3, {sa_family=AF_INET, sin_port=htons(0),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000031>
> >  > 0.000068 listen(3, 50)       = 0 <0.000014>
> >  > 0.000032 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000010>
> >  > 0.000030 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000018>
> >  > 0.000060 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5
> > <0.000023>
> >  > 0.000071 getsockname(3, {sa_family=AF_INET, sin_port=htons(45901),
> > sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000019>
> >  > 0.000068 connect(5, {sa_family=AF_INET, sin_port=htons(45901),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in
> > progress) <0.000074>
> >  > 0.000259 accept4(3, {sa_family=AF_INET, sin_port=htons(35002),
> > sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000014>
> >  > 0.000051 getsockname(6, {sa_family=AF_INET, sin_port=htons(45901),
> > sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000010>
> >  > 0.000082 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> >  > 0.000040 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000009>
> >  > 0.000104 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000009>
> >  > 0.000028 connect(5, {sa_family=AF_INET, sin_port=htons(45901),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000009>
> >  > 0.000026 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> >  > 0.000024 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
> >  > 0.000071 sendto(6, "********************************"..., 46080, 0,
> > NULL, 0) = 16640 <0.000026>
> >  > 0.000117 sendto(6, "********************************"..., 29440, 0,
> > NULL, 0) = 16640 <0.000017>
> >  > 0.000041 sendto(6, "********************************"..., 12800, 0,
> > NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000009>
> >  > 0.000075 recvfrom(6, 0x7f4db88a38c0, 10, 0, NULL, NULL) = -1 EAGAIN
> > (Resource temporarily unavailable) <0.000010>
> >  > 0.000086 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 16640 <0.000018>
> >  > 0.000044 recvfrom(5, 0x55a64d59b2a0, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000009>
> >  > 0.000059 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000018>
> >  > 0.000093 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000009>
> >  > 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
> >  > 0.206685 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 16640 <0.000116>
> >  > 0.000306 recvfrom(5, 0x55a64d5a7600, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000013>
> >  > 0.000208 sendto(6, "********************************"..., 12800, 0,
> > NULL, 0) = 12800 <0.000025>
> >  > 0.206317 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000171>
> >  > 0.000304 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000029>
> >  > 0.206161 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000082>
> >  > 0.000212 recvfrom(5, 0x55a64d5a0ed0, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000034>
> >  > 0.206572 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000146>
> >  > 0.000274 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000029>
> >  > 0.206604 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000162>
> >  > 0.000270 recvfrom(5, 0x55a64d5a20d0, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000016>
> >  > 0.206164 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000116>
> >  > 0.000291 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 1280 <0.000038>
> >  > 0.052224 +++ exited with 0 +++
> >
> > I.e. recvfrom() returns -EAGAIN and takes 200 ms.
>
> Ah, no, those 200 ms are not spent in recvfrom() but in poll:
>  > 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
>  > 0.000040 epoll_wait(4, [{events=EPOLLIN, data={u32=5,
> u64=139942919405573}}], 1023, 60000) = 1 <0.204038>
>  > 0.204258 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b4049c) = 0 <0.000045>
>  > 0.000104 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 16640 <0.000078>
>  > 0.000264 recvfrom(5, 0x5603425f3550, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000025>
>  > 0.000127 epoll_ctl(4, EPOLL_CTL_ADD, 5,
> {events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5,
> u64=94570884890629}}) = 0 <0.000031>
>  > 0.000112 epoll_wait(4, [{events=EPOLLOUT, data={u32=6,
> u64=139942919405574}}], 1023, 60000) = 1 <0.000026>
>  > 0.000083 epoll_ctl(4, EPOLL_CTL_DEL, 6, 0x7ffd49b404fc) = 0 <0.000025>
>  > 0.000063 sendto(6, "********************************"..., 12800, 0,
> NULL, 0) = 12800 <0.000028>
>  > 0.000226 epoll_wait(4, [], 1023, 0) = 0 <0.000007>
>  > 0.000029 epoll_wait(4, [{events=EPOLLIN, data={u32=5,
> u64=94570884890629}}], 1023, 60000) = 1 <0.205476>
>  > 0.205728 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000077>
>  > 0.000157 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000066>
>  > 0.000180 recvfrom(5, 0x5603425e30c0, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000026>
>  > 0.000139 epoll_ctl(4, EPOLL_CTL_ADD, 5,
> {events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0
> <0.000030>
>  > 0.000104 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023,
> 60000) = 1 <0.205881>
>  > 0.206222 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000086>
>  > 0.000189 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000027>
>  > 0.000153 recvfrom(5, 0x5603425ece20, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000017>
>  > 0.000074 epoll_ctl(4, EPOLL_CTL_ADD, 5,
> {events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0
> <0.000018>
>  > 0.000067 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023,
> 60000) = 1 <0.205749>
>  > 0.206088 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000197>
>  > 0.000391 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000080>
>  > 0.000210 recvfrom(5, 0x5603425e30c0, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000027>
>  > 0.000174 epoll_ctl(4, EPOLL_CTL_ADD, 5,
> {events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0
> <0.000031>
>  > 0.000127 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023,
> 60000) = 1 <0.205471>
>  > 0.205783 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000205>
>

It seems the test had some expectations.

Setting a small (1 byte) RCVBUF/SNDBUF, and yet expecting to send
46080 bytes fast enough was not reasonable.
It might have relied on the fact that tcp sendmsg() can cook large GSO
packets, even if sk->sk_sndbuf is small.

With tight memory settings, it is possible TCP has to resort on RTO
timers (200ms by default) to recover from dropped packets.

What happens if you double /proc/sys/net/ipv4/tcp_wmem  and/or
/proc/sys/net/ipv4/tcp_rmem first value ?
(4096 -> 8192)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ