netdev - Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2986.1403056830@localhost.localdomain>
Date:	Tue, 17 Jun 2014 19:00:30 -0700
From:	Jay Vosburgh <jay.vosburgh@...onical.com>
To:	Neal Cardwell <ncardwell@...gle.com>
cc:	Michal Kubecek <mkubecek@...e.cz>,
	Yuchung Cheng <ycheng@...gle.com>,
	"David S. Miller" <davem@...emloft.net>,
	netdev <netdev@...r.kernel.org>,
	Alexey Kuznetsov <kuznet@....inr.ac.ru>,
	James Morris <jmorris@...ei.org>,
	Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
	Patrick McHardy <kaber@...sh.net>
Subject: Re: [PATCH net] tcp: avoid multiple ssthresh reductions in on retransmit window

Neal Cardwell <ncardwell@...gle.com> wrote:

>On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh
><jay.vosburgh@...onical.com> wrote:
>> Michal Kubecek <mkubecek@...e.cz> wrote:
>>
>>>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
>>>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@...e.cz> wrote:
>>>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>>>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@...gle.com> wrote:
>>>> >> > However Linux is inconsistent on the loss of a retransmission. It
>>>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>>>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>>>> >> > should help dealing with traffic policers.
>>>> >>
>>>> >> Yes, great point!
>>>> >
>>>> > Does it mean the patch itself would be acceptable if the reasoning in
>>>> > its commit message was changed? Or would you prefer a different way to
>>>> > unify the two situations?
>>>>
>>>> It's the latter but it seems to belong to a different patch (and it'll
>>>> not solve the problem you are seeing).
>>>
>>>OK, thank you. I guess we will have to persuade them to move to cubic
>>>which handles their problems much better.
>>>
>>>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
>>>> across round trips of send, but not within an RTT. Suppose cwnd was
>>>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
>>>> round trips, we time out again. By the design of Reno this should
>>>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.
>>>
>>>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
>>>cwnd, not current ssthresh.
>>>
>>>BtW, this is exactly the problem our customer is facing: they have
>>>relatively fast line (15 Mb/s) but with big buffers so that the
>>>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
>>>under full load.
>>>
>>>What happens is this: cwnd initally rises to ~2100 then first drops
>>>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
>>>lets cwnd reach ssthresh but after that, a slow linear growth follows.
>>>In this state, all in-flight packets are dropped (simulation of what
>>>happens on router switchover) so that cwnd is reset to 1 again and
>>>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
>>>If a packet loss comes shortly after that, cwnd is still very low and
>>>ssthresh is reduced to half of that cwnd (i.e. much lower than to half
>>>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
>>>which takes really long to recover from.
>>
>>         I'm also looking into a problem that exhibits very similar TCP
>> characteristics, even down to cwnd and ssthresh values similar to what
>> you cite.  In this case, the situation has to do with high RTT (around
>> 80 ms) connections competing with low RTT (1 ms) connections.  This case
>> is already using cubic.
>>
>>         Essentially, a high RTT connection to the server transfers data
>> in at a reasonable and steady rate until something causes some packets
>> to be lost (in this case, another transfer from a low RTT host to the
>> same server).  Some packets are lost, and cwnd drops from ~2200 to ~300
>> (in stages, first to ~1500, then ~600, then to ~300, ).  The ssthresh
>> starts at around 1100, then drops to ~260, which is the lowest cwnd
>> value.
>>
>>         The recovery from the low cwnd situation is very slow; cwnd
>> climbs a bit and then remains essentially flat for around 5 seconds.  It
>> then begins to climb until a few packets are lost again, and the cycle
>> repeats.  If no futher losses occur (if the competing traffic has
>> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
>> full value (~2200) requires on the order of 20 seconds.  The connection
>> exits recovery state fairly quickly, and most of the 20 seconds is spent
>> in open state.
>
>Interesting. I'm a little surprised it takes CUBIC so long to re-grow
>cwnd to the full value. Would you be able to provide your kernel
>version number and post a tcpdump binary packet trace somewhere
>public?

	The kernel I'm using at the moment is an Ubuntu 3.2.0-54 distro
kernel, but I've reproduced the problem on Ubuntu distro 3.13 and a
mainline 3.15-rc (although in the 3.13/3.15 cases using netem to inject
delay).  I've been gathering data mostly with systemtap, but I should be
able to get some packet captures as well, although not until tomorrow.

	The test I'm using right now is pretty simple. I have three
machines: two, A and B, are separated by about 80 ms RTT; the third
machine, C, is about 1 ms from B, so:

	A --- 80ms --- B --- 1ms ---- C

	On A, I run an "iperf -i 1" to B, and let it max its cwnd, and
then on C, run an "iperf -t 1" to B ("-t 1" means only run for one
second then exit).  The iperf results on A look like this:

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   896 KBytes  7.34 Mbits/sec
[  3]  1.0- 2.0 sec  1.50 MBytes  12.6 Mbits/sec
[  3]  2.0- 3.0 sec  4.62 MBytes  38.8 Mbits/sec
[  3]  3.0- 4.0 sec  13.5 MBytes   113 Mbits/sec
[  3]  4.0- 5.0 sec  27.8 MBytes   233 Mbits/sec
[  3]  5.0- 6.0 sec  39.0 MBytes   327 Mbits/sec
[  3]  6.0- 7.0 sec  36.9 MBytes   309 Mbits/sec
[  3]  7.0- 8.0 sec  34.8 MBytes   292 Mbits/sec
[  3]  8.0- 9.0 sec  39.0 MBytes   327 Mbits/sec
[  3]  9.0-10.0 sec  36.9 MBytes   309 Mbits/sec
[  3] 10.0-11.0 sec  36.9 MBytes   309 Mbits/sec
[  3] 11.0-12.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3] 12.0-13.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 13.0-14.0 sec  2.88 MBytes  24.1 Mbits/sec
[  3] 14.0-15.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 15.0-16.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 16.0-17.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 17.0-18.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 18.0-19.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 19.0-20.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 20.0-21.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 21.0-22.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 22.0-23.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 23.0-24.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 24.0-25.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 25.0-26.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 26.0-27.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 27.0-28.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 28.0-29.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3] 29.0-30.0 sec  12.9 MBytes   108 Mbits/sec
[  3] 30.0-31.0 sec  15.0 MBytes   126 Mbits/sec
[  3] 31.0-32.0 sec  15.0 MBytes   126 Mbits/sec
[  3] 32.0-33.0 sec  21.8 MBytes   182 Mbits/sec
[  3] 33.0-34.0 sec  21.4 MBytes   179 Mbits/sec
[  3] 34.0-35.0 sec  27.8 MBytes   233 Mbits/sec
[  3] 35.0-36.0 sec  32.6 MBytes   274 Mbits/sec
[  3] 36.0-37.0 sec  36.6 MBytes   307 Mbits/sec
[  3] 37.0-38.0 sec  36.6 MBytes   307 Mbits/sec

	The second iperf starts at about time 10.  The middle value is 1
second's throughput, so the flat throughput between roughly time 13 and
time 23 is the cwnd slow recovery.

	I've got one graph prepared already that I can post:

http://people.canonical.com/~jvosburgh/t-vs-cwnd-ssthresh.jpg

	This shows cwnd (green) and ssthresh (red) vs. time.  In this
case, the second (low RTT) iperf started at the first big drop at around
time 22 and ran for 30 seconds (its data is not on the graph).  The big
cwnd drop is actually a series of drops, but that's hard to see at this
scale.  This graph shows two of the slow recoveries, and was done on a
3.13 kernel using netem to add delay.  The cwnd and ssthresh data was
captured by systemtap when exiting tcp_ack.

>One thing you could try would be to disable CUBIC's "fast convergence" feature:
>
>  echo 0 > /sys/module/tcp_cubic/parameters/fast_convergence
>
>We have noticed that this feature can hurt performance when there is a
>high rate of random packet drops (packet drops that are not correlated
>with the sending rate of the flow in question).

	I ran the above iperf results with this disabled; it does not
appear to have any effect.

	-J

---
	-Jay Vosburgh, jay.vosburgh@...onical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html