netdev - Re: scp stalls mysteriously

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.00.0912040006280.776@melkinpaasi.cs.helsinki.fi>
Date:	Fri, 4 Dec 2009 12:41:32 +0200 (EET)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	Frederic Leroy <fredo@...rox.org>
cc:	Damian Lukowski <damian@....rwth-aachen.de>,
	Netdev <netdev@...r.kernel.org>,
	David Miller <davem@...emloft.net>,
	Eric Dumazet <eric.dumazet@...il.com>,
	Herbert Xu <herbert@...dor.apana.org.au>,
	Greg KH <gregkh@...e.de>
Subject: Re: scp stalls mysteriously

On Thu, 3 Dec 2009, Frederic Leroy wrote:

> Le Thu, 03 Dec 2009 21:34:00 +0100,
> Damian Lukowski <damian@....rwth-aachen.de> a écrit :
> 
> > Frederic Leroy schrieb:
> > > Le Thu, 03 Dec 2009 15:10:11 +0100,
> > > Damian Lukowski <damian@....rwth-aachen.de> a écrit :
> > >>> I suppose adding || !tp->retrans_stamp into the false condition is
> > >>> fine as long as we don't then have a connection that can cause a
> > >>> connection to hang there forever for some reason (this needs to be
> > >>> understood well enough, not just test driven in stables :-)).
> > >>>
> > >>>> Unluckily, I still cannot reproduce the scp stalls here, so it
> > >>>> would be nice if Frederic printed retrans_stamp together with
> > >>>> icsk_ca_state and icsk_retransmits, please.
> > >>> It wouldn't hurt to know tp->packets_out and tp->retrans_out too,
> > >>> that might have some significant w.r.t what happens because of
> > >>> FRTO.
> > >> I made a patch for Frederic with Codebase
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> > >>
> > >> Thanks for testing.
> > > 
> > > I made a new .11 trace with damian patch.
> > > The copy went to the end. 
> > 
> > Ok, at least this "fix" seems to work at first glance, but the printk
> > is quite useless now. Could you run another test with the printk's but
> > without the retrans_stamp == 0 check, please?
> 
> Done, it's .13
> 
> It stalled.
> 
> I only manage to make a failing stream only once. It seems
> after 23h the network is not interferenced.

...Your neighbors fall early into sleep :-).

It seems that TCP was in recovery (ca_state 3) when the timeout triggered 
but for some reason no retransmissions in flight. Also, TCPCB_LOST was 
marked for skbs, thus making them available for retransmission. I've read 
tcp_xmit_retransmit_queue multiple times through (and also 
tcp_mark_head_lost) but cannot find out what would prevent the rexmit 
loop from sending the retransmissions. And they seem to work for the
opposite end though.

But there is more, also for the working case and RTOs (.12.) we see 
rexmits not always flying out, see this:

t, sacked_out: 208867, 208867, 1, 1, 6, 1, 1, 1, 4
t, sacked_out: 209426, 0, 0, 3, 6, 0, 0, 1, 5
t, sacked_out: 209426, 0, 0, 3, 6, 0, 0, 1, 5
t, sacked_out: 209426, 0, 1, 1, 6, 0, 1, 1, 5
t, sacked_out: 210256, 0, 1, 1, 6, 0, 1, 1, 5
t, sacked_out: 210256, 0, 1, 1, 6, 0, 1, 1, 5
t, sacked_out: 210256, 210256, 2, 1, 6, 1, 1, 1, 5
t, sacked_out: 210855, 0, 0, 3, 6, 0, 0, 1, 4

The fourth row has increase isck_retransmit (in tcp_retransmit_timer, 
called near the end) but there are no retransmission in flight (and that 
previous line is printed right before that so there is no room for races 
or so)? ...I suspect we find the problem from tcp_retransmit_skb or from 
the stuff it calls into if looking carefully enough. Besides, that is the 
only common denominator for fast recovery retransmissions and for the RTO 
retransmission.

-- 
 i.