netdev - Re: TCP stack bug related to F-RTO?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 25 Sep 2009 16:09:38 +0300 (EEST)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	Ray Lee <ray-lk@...rabbit.org>
cc:	Joe Cao <caoco2002@...oo.com>, Netdev <netdev@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>, jcaoco2002@...oo.com
Subject: Re: TCP stack bug related to F-RTO?

On Thu, 24 Sep 2009, Ray Lee wrote:

> [adding netdev cc:]
> 
> On Thu, Sep 24, 2009 at 10:43 AM, Joe Cao <caoco2002@...oo.com> wrote:
> >
> > Hello,
> >
> > I have found the following behavior with different versions of linux 
> > kernel. The attached pcap trace is collected with server 
> > (192.168.0.13) running 2.6.24 and shows the problem. Basically the 
> > behavior is like this: 
> >
> > 1. The client opens up a big window,
> > 2. the server sends 19 packets in a row (pkt #14- #32 in the trace), but all of them are dropped due to some congestion.
> > 3. The server hits RTO and retransmits pkt #14 in #33
> > 4. The client immediately acks #33 (=#14), and the server (seems like to enter F-RTO) expends the window and sends *NEW* pkt #35 & #36.=A0 Timeoute is doubled to 2*RTO; The client immediately sends two Dup-ack to #35 and #36.
> > 5. after 2*RTO, pkt #15 is retransmitted in #39.
> > 6. The client immediately acks #39 (=#15) in #40, and the server continues to expand the window and sends two *NEW* pkt #41 & #42. Now the timeoute is doubled to 4 *RTO.
> > 8. After 4*RTO timeout, #16 is retransmitted.
> > 9....
> > 10. The above steps repeats for retransmitting pkt #16-#32 and each time the timeout is doubled.
> > 11. It takes a long long time to retransmit all the lost packets and before that is done, the client sends a RST because of timeout.
> >
> > The above behavior looks like F-RTO is in effect.  And there seems to 
> > be a bug in the TCP's congestion control and retransmission algorithm. 
> > Why doesn't the TCP on server (running 2.6.24) enter the slow start? 
> > Why should the server take that long to recover from a short period 
> > of packet loss?
> >
> > Has anyone else noticed similar problem before?  If my analysis was 
> > wrong, can anyone gives me some pointers to what's really wrong and 
> > how to fix it?

Yes, 2.6.24 is an obsoleted version with known wrongs in FRTO 
implementation. Fixes never when to 2.6.24 stable series as it was 
_already_ obsoleted when the problems where reported and found. The 
correct fixes may be found from 2.6.25.7 (.7 iirc) and are included from 
2.6.26 onward too.

Just in case you happen to run ubuntu based kernel from that era (of 
course you should be reporting the bug here then...), a word of warning: 
it seemed nearly impossible for them to get a simple thing like that 
fixed, I haven't been looking if they'd eventually come to some sensible 
conclusion in that matter or is it still unresolved (or e.g., closed 
without real resolution).

-- 
 i.