lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210111222411.232916-5-hcaldwel@akamai.com>
Date:   Mon, 11 Jan 2021 17:24:11 -0500
From:   Heath Caldwell <hcaldwel@...mai.com>
To:     <netdev@...r.kernel.org>
CC:     Eric Dumazet <edumazet@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        Josh Hunt <johunt@...mai.com>, Ji Li <jli@...mai.com>,
        Heath Caldwell <hcaldwel@...mai.com>
Subject: [PATCH net-next 4/4] tcp: remove limit on initial receive window

Remove the 64KB limit imposed on the initial receive window.

The limit was added by commit a337531b942b ("tcp: up initial rmem to 128KB
and SYN rwin to around 64KB").

This change removes that limit so that the initial receive window can be
arbitrarily large (within existing limits and depending on the current
configuration).

The arbitrary, internal limit can interfere with research because it
irremediably restricts the receive window at the beginning of a connection
below what would be expected when explicitly configuring the receive buffer
size.

-

Here is a scenario to illustrate how the limit might cause undesirable
behavior:

Consider an installation where all parts of a network are either controlled
or sufficiently monitored and there is a desired use case where a 1MB
object is transmitted over a newly created TCP connection in a single
initial burst.

Let MSS be 1460 bytes.

The initial cwnd would need to be at least:

                |-  1048576 bytes  -|
    cwnd_init = |  ---------------  | = 719 packets
                |   1460 bytes/pkt  |

Let us say that it was determined that the network could handle bursts of
800 full sized packets at the frequency which the connections under
consideration would be expected to occur, so the sending host is configured
to use an initial cwnd of 800 for these connections.

In order for the receiver to be able to receive a 1MB burst, it needs to
have a sufficiently large receive buffer for the connection.  Considering
overhead, let us say that the receiver is configured to initially use a
receive buffer of 2148K for TCP connections:

    net.ipv4.tcp_rmem = 4096 2199552 6291456

Let rtt be 50 milliseconds.

If the entire object is sent in a single burst, then the theoretically
highest achievable throughput (discounting handshake and request) should
be:

                   bits   1048576 bytes   8 bits
    T_upperbound = ---- = ------------- * ------ =~ 168 Mbit/s
                   rtt       0.05 s       1 byte

But, if flow control limits throughput because the receive window is
initially limited to 64KB and grows at a rate of quadrupling every
rtt (maybe not accurate but seems to be optimistic from observation), we
should expect the highest achievable throughput to be limited to:

    bytes_sent = 65536 * (1 + 4)^(t / rtt)

    When bytes_sent = object size = 1048576:

    1048576 = 65536 * (1 + 4)^(t / rtt)
          t = rtt * log_5(16)

                            1048576 bytes              8 bits
    T_limited = ------------------------------------ * ------
                       /    |- rtt * log_5(16) -| \    1 byte
                rtt * ( 1 + |  ---------------- |  )
                       \    |        rtt        | /

                 1048576 bytes     8 bits
              = ---------------- * ------
                0.05 s * (1 + 2)   1 byte

              =~ 55.9 Mbit/s

In short: for this scenario, the 64KB limit on the initial receive window
increases the achievable acknowledged delivery time from 1 rtt
to (optimistically) 3 rtts, reducing the achievable throughput from
168 Mbit/s to 55.9 Mbit/s.

Here is an experimental illustration:

A time sequence chart of a packet capture taken on the sender for a
scenario similar to what is described above, where the receiver had the
64KB limit in place:

Symbols:
.:' - Data packets
_-  - Window advertised by receiver

y-axis - Relative sequence number
x-axis - Time from sending of first data packet, in seconds

3212891                                                                   _
3089318                                                                   -
2965745                                                                   -
2842172                                                                   -
2718600                                                           ________-
2595027                                                           -
2471454                                                           -
2347881                                                    --------
2224309                                                    _
2100736                                                    -
1977163                                                   --
1853590                                                   _
1730018                                                   -
1606445                                                   -
1482872                                                   -
1359300                                                   -
1235727                                                   -
1112154                                                   -
 988581                                                  _:
 865009                                   _______--------.:
 741436                                   .      :       '
 617863                                  -:
 494290                                  -:
 370718                                  .:
 247145                  --------.-------:
 123572 _________________:       '
      0 .:               '
      0.000    0.028    0.056    0.084    0.112    0.140    0.168    0.195

Note that the sender was not able to send the object in a single initial
burst and that it took around 4 rtts for the object to be fully
acknowledged.


A time sequence chart of a packet capture taken for the same scenario, but
with the limit removed:

2147035                                                                  __
2064456                                                                 _-
1981878                                                                _-
1899300                                                                -
1816721                                                               --
1734143                                                              _-
1651565                                                             _-
1568987                                                             -
1486408                                                            --
1403830                                                           _-
1321252                                                          _-
1238674                                                          -
1156095 ________________________________________________________--
1073517
 990939           :
 908360          :'
 825782         :'
 743204        .:
 660626        :
 578047       :'
 495469      :'
 412891     .:
 330313    .:
 247734    :
 165156   :'
  82578  :'
      0 .:
      0.000    0.008    0.016    0.025    0.033    0.041    0.049    0.057

Note that the sender was able to send the entire object in a single burst
and that it was fully acknowledged after a little over 1 rtt.

Signed-off-by: Heath Caldwell <hcaldwel@...mai.com>
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1d2773cd02c8..d7ab1f5f071e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
 	if (sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)
 		(*rcv_wnd) = min(space, MAX_TCP_WINDOW);
 	else
-		(*rcv_wnd) = min_t(u32, space, U16_MAX);
+		(*rcv_wnd) = space;
 
 	if (init_rcv_wnd)
 		*rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
-- 
2.28.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ