netdev - Re: TCP rx window autotuning harmful at LAN context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <1e41a3230903091323j541d1895j2eb69b9f9c11f2f3@mail.gmail.com>
Date:	Mon, 9 Mar 2009 13:23:15 -0700
From:	John Heffner <johnwheffner@...il.com>
To:	Marian Ďurkovič <md@....sk>
Cc:	netdev@...r.kernel.org
Subject: Re: TCP rx window autotuning harmful at LAN context

On Mon, Mar 9, 2009 at 1:02 PM, Marian Ďurkovič <md@....sk> wrote:
> On Mon, 9 Mar 2009 11:01:52 -0700, John Heffner wrote
>> On Mon, Mar 9, 2009 at 4:25 AM, Marian Ďurkovič <md@....sk> wrote:
>> >   As rx window autotuning is enabled in all recent kernels and with 1 GB
>> > of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly
>> > and we believe it needs urgent attention. As demontrated above, such huge
>> > rx window (which is at least 100*BDP of the example above) does not deliver
>> > any performance gain but instead it seriously harms other hosts and/or
>> > applications. It should also be noted, that host with autotuning enabled
>> > steals an unfair share of the total available bandwidth, which might look
>> > like a "better" performing TCP stack at first sight - however such behaviour
>> > is not appropriate (RFC2914, section 3.2).
>>
>> It's well known that "standard" TCP fills all available drop-tail
>> buffers, and that this behavior is not desirable.
>
> Well, in practice that was always limited by receive window size, which
> was by default 64 kB on most operating systems. So this undesirable behavior
> was limited to hosts where receive window was manually increased to huge values.
>
> Today, the real effect of autotuning is the same as changing the receive window
> size to 4 MB on *all* hosts, since there's no mechanism to prevent it from
> growing the window to maximum even for low RTT paths.
>
>> The situation you describe is exactly what congestion control (the
>> topic of RFC2914) should fix.  It is not the role of receive window
>> (flow control).  It is really the sender's job to detect and react to
>> this, not the receiver's.  (We have had this discussion before on
>> netdev.)
>
> It's not of high importance whose job it is according to pure theory.
> What matters is, that autotuning introduced serious problem at LAN context
> by disabling any possibility to properly react to increasing RTT. Again,
> it's not important whether this functionality was there by design or by
> coincidence, but it was holding the system well-balanced for many years.

This is not a theoretical exercise, but one in good system design.
This "well-balanced" system was really broken all along, and
autotuning has exposed this.

A drop-tail queue size of 1000 packets on a local interface is
questionable, and I think this is the real source of your problem.
This change was introduced a few years ago on most drivers --
generally used to be 100 by default.  This was partly because TCP
slow-start has problems when a drop-tail queue is smaller than the
BDP.  (Limited slow-start is meant to address this problem, but
requires tuning to the right value.)  Again, using AQM is likely the
best solution.


> Now, as autotuning is enabled by default in stock kernel, this problem is
> spreading into LANs without users even knowing what's going on. Therefore
> I'd like to suggest to look for a decent fix which could be implemented
> in relatively short time frame. My proposal is this:
>
> - measure RTT during the initial phase of TCP connection (first X segments)
> - compute maximal receive window size depending on measured RTT using
>  configurable constant representing the bandwidth part of BDP
> - let autotuning do its work upto that limit.

Let's take this proposal, and try it instead at the sender side, as
part of congestion control.  Would this proposal make sense in that
position?  Would you seriously consider it there?

(As a side note, this is in fact what happens if you disable
timestamps, since TCP cannot get an updated measurement of RTT without
timestamps, only a lower bound.  However, I consider this a limitation
not a feature.)

  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html