netdev - Re: scp stalls mysteriously

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.00.0912061206490.19391@melkinpaasi.cs.helsinki.fi>
Date:	Sun, 6 Dec 2009 12:29:20 +0200 (EET)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	Damian Lukowski <damian@....rwth-aachen.de>
cc:	Frederic Leroy <fredo@...rox.org>, Netdev <netdev@...r.kernel.org>,
	David Miller <davem@...emloft.net>,
	Eric Dumazet <eric.dumazet@...il.com>,
	Herbert Xu <herbert@...dor.apana.org.au>,
	Greg KH <gregkh@...e.de>
Subject: Re: scp stalls mysteriously

On Sat, 5 Dec 2009, Damian Lukowski wrote:

> Frederic Leroy schrieb:
> > Le Thu, 03 Dec 2009 15:10:11 +0100,
> > Damian Lukowski <damian@....rwth-aachen.de> a écrit :
> >>> I suppose adding || !tp->retrans_stamp into the false condition is
> >>> fine as long as we don't then have a connection that can cause a
> >>> connection to hang there forever for some reason (this needs to be
> >>> understood well enough, not just test driven in stables :-)).
> >>>
> >>>> Unluckily, I still cannot reproduce the scp stalls here, so it
> >>>> would be nice if Frederic printed retrans_stamp together with
> >>>> icsk_ca_state and icsk_retransmits, please.
> >>> It wouldn't hurt to know tp->packets_out and tp->retrans_out too,
> >>> that might have some significant w.r.t what happens because of FRTO.
> >> I made a patch for Frederic with Codebase
> >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> >>
> >> Thanks for testing.
> > 
> > I made a new .11 trace with damian patch.
> > The copy went to the end.
> 
> Could you please make another test and unplug the cable or drop ACKs for
> several seconds, so that some RTO retransmissions are performed?
> I'd like to see if retrans_stamp remains constant. In the dmesg output of
> the 11th run, it seems to change while icsk_retransmits also increases.
> This is kind of bad for connection timeout calculation in the RTO case ...

...Good point, I think that's another bug in this area. We should prevent 
retrans_stamp update when the RTO itself caused tp->retrans_out to become
zero. This bug affects also other things that are based on retrans_stamp, 
not only your code but there the effect is just less devastating.

...Anyway, we know already what happens by reading code when we know the 
case what to look for. If tcp_retransmit_skb encounters any error no 
retrans_stamp is changed (besides, cable might not be enough if you don't 
lose the route, depends on configuration what happens). And if retrans_out 
is zero, the stamp gets updated (assuming the rexmit was successfully 
made without some error condition).

What I think we need is something like this to handle all error cases 
cleanly (plus to fix that another bug I mentioned above):

	if (!icsk->icsk_retransmits)
		return false;

	if (unlikely(!tp->retrans_stamp)) {
		start_ts = TCP_SKB_CB(tcp_write_queue_head(sk))->when;
	} else {
		start_ts = tp->retrans_stamp;
	}

...and then use that timestamp in the substraction. It handles the case 
where retrans_stamp was never updated (agreeably there are corner cases 
even then where the connection doesn't die exactly when we would want when 
retrans_stamp got set on some high icsk_retransmits but still it would 
only be less than double of the specified time). ...We just need to audit 
that tcp_write_queue_head and when are always valid when call happens (I 
think they should be but that has to checked).

-- 
 i.