netdev - Re: TCP connection stalls under 2.6.24.7

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.64.0807181617550.24938@wrl-59.cs.helsinki.fi>
Date:	Fri, 18 Jul 2008 16:55:22 +0300 (EEST)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	Thomas Jarosch <thomas.jarosch@...ra2net.com>
cc:	Jozsef Kadlecsik <kadlec@...ckhole.kfki.hu>,
	Netdev <netdev@...r.kernel.org>,
	Patrick McHardy <kaber@...sh.net>,
	Sven Riedel <sr@...urenet.de>,
	Netfilter Developer Mailing List 
	<netfilter-devel@...r.kernel.org>,
	"Dâniel Fraga" <fragabr@...il.com>,
	David Miller <davem@...emloft.net>
Subject: Re: TCP connection stalls under 2.6.24.7

On Fri, 18 Jul 2008, Thomas Jarosch wrote:

> On Thursday, 17. July 2008 17:53:01 Ilpo Järvinen wrote:
> > > > One option would be to disable reentry to FRTO when some progress was
> > > > made... Please try with the patch below...
> >
> > Ah, I just forgot that the situation might persist... Try with this
> > one instead...
> 
> Good news everyone: Two connections made it to the finish line.
> 
> The bad part: One transfer took four minutes, the other sixteen minutes.
> A colleague commented it's still much faster than carrying the message
> by plane ;-) A session without FRTO takes around 84 seconds.

...I guess if you would limit ssthresh to some small value you might beat 
that value even without FRTO.

> I've added debug printks() to every return path in tcp_use_frto(),
> so you can see what's going on. They look like this:
> 
> Jul 18 10:20:40 intratest131 kernel: [  957.318006] tcp_use_frto: ENTER: frto_counter: 0, icsk->icsk_ca_state: 0
> Jul 18 10:20:40 intratest131 kernel: [  957.318011] tcp_use_frto: DEFAULT RETURN 1;
> Jul 18 10:21:08 intratest131 kernel: [  984.446006] tcp_use_frto: ENTER: frto_counter: 3, icsk->icsk_ca_state: 0
> Jul 18 10:21:08 intratest131 kernel: [  984.446011] tcp_use_frto: RETURN in "tp->frto_counter > 1 || icsk->icsk_ca_state == TCP_CA_Loss"
> Jul 18 10:21:14 intratest131 kernel: [  991.058006] tcp_use_frto: ENTER: frto_counter: 0, icsk->icsk_ca_state: 0
> Jul 18 10:21:14 intratest131 kernel: [  991.058011] tcp_use_frto: DEFAULT RETURN 1;
> 
> Here are two new dumps and the corresponding debug traces:
> http://www.intra2net.com/de/download/tcpdump/tcp_frto_second_patch.tar.bz2

It seems that with FRTO the retransmission timeout grows much higher which 
causes longer delays when things continue by RTO, this might be plainly 
due to the fact that some timeouts seem indeed spurious, and with FRTO we 
can take RTT measures out of such. I'll keep digging deeper... The 
receiver is definately doing something crazy as well, eg.:

6.1.131.56060: . ack 1995587 win 65535
152.31.131.25: . 1998387:1999787(1400) ack 562 win 7504 (DF)
152.31.131.25: . 1999787:2001187(1400) ack 562 win 7504 (DF)
152.31.131.25: . 2001187:2002587(1400) ack 562 win 7504 (DF)
6.1.131.56060: . ack 1995587 win 8192 (DF)
6.1.131.56060: . ack 1996987 win 8192 (DF)
6.1.131.56060: . ack 1996987 win 8192 (DF)
6.1.131.56060: . ack 1996987 win 8192 (DF)

...The receiver shrunk the window here (it's not the only example) :-), 
though on the bright side, those are duplicate ACKs... :-D

Btw, on which kernel you ran these things (I hope it wasn't 2.6.24.7, 
which has FRTO related bugs anyway that the patches I've sent now won't 
fix)? 

-- 
 i.