[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <4B1C33C5.8030309@tvk.rwth-aachen.de>
Date: Sun, 06 Dec 2009 23:44:21 +0100
From: Damian Lukowski <damian@....rwth-aachen.de>
To: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
Cc: Frederic Leroy <fredo@...rox.org>, Netdev <netdev@...r.kernel.org>,
David Miller <davem@...emloft.net>,
Eric Dumazet <eric.dumazet@...il.com>,
Herbert Xu <herbert@...dor.apana.org.au>,
Greg KH <gregkh@...e.de>
Subject: Re: scp stalls mysteriously
Ilpo Järvinen schrieb:
> On Sat, 5 Dec 2009, Damian Lukowski wrote:
>
>> Frederic Leroy schrieb:
>>> Le Thu, 03 Dec 2009 15:10:11 +0100,
>>> Damian Lukowski <damian@....rwth-aachen.de> a écrit :
>>>>> I suppose adding || !tp->retrans_stamp into the false condition is
>>>>> fine as long as we don't then have a connection that can cause a
>>>>> connection to hang there forever for some reason (this needs to be
>>>>> understood well enough, not just test driven in stables :-)).
>>>>>
>>>>>> Unluckily, I still cannot reproduce the scp stalls here, so it
>>>>>> would be nice if Frederic printed retrans_stamp together with
>>>>>> icsk_ca_state and icsk_retransmits, please.
>>>>> It wouldn't hurt to know tp->packets_out and tp->retrans_out too,
>>>>> that might have some significant w.r.t what happens because of FRTO.
>>>> I made a patch for Frederic with Codebase
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
>>>>
>>>> Thanks for testing.
>>> I made a new .11 trace with damian patch.
>>> The copy went to the end.
>> Could you please make another test and unplug the cable or drop ACKs for
>> several seconds, so that some RTO retransmissions are performed?
>> I'd like to see if retrans_stamp remains constant. In the dmesg output of
>> the 11th run, it seems to change while icsk_retransmits also increases.
>> This is kind of bad for connection timeout calculation in the RTO case ...
>
>
> ...Good point, I think that's another bug in this area. We should prevent
> retrans_stamp update when the RTO itself caused tp->retrans_out to become
> zero. This bug affects also other things that are based on retrans_stamp,
> not only your code but there the effect is just less devastating.
>
> ...Anyway, we know already what happens by reading code when we know the
> case what to look for. If tcp_retransmit_skb encounters any error no
> retrans_stamp is changed (besides, cable might not be enough if you don't
> lose the route, depends on configuration what happens). And if retrans_out
> is zero, the stamp gets updated (assuming the rexmit was successfully
> made without some error condition).
>
> What I think we need is something like this to handle all error cases
> cleanly (plus to fix that another bug I mentioned above):
>
> if (!icsk->icsk_retransmits)
> return false;
>
> if (unlikely(!tp->retrans_stamp)) {
> start_ts = TCP_SKB_CB(tcp_write_queue_head(sk))->when;
> } else {
> start_ts = tp->retrans_stamp;
> }
>
> ...and then use that timestamp in the substraction. It handles the case
> where retrans_stamp was never updated (agreeably there are corner cases
> even then where the connection doesn't die exactly when we would want when
> retrans_stamp got set on some high icsk_retransmits but still it would
> only be less than double of the specified time). ...We just need to audit
> that tcp_write_queue_head and when are always valid when call happens (I
> think they should be but that has to checked).
retransmits_timed_out() is called directly by tcp_retransmit_timer() and by
tcp_write_timeout() which itself is only called by tcp_retransmit_timer().
As tcp_retransmit_timer() calls tcp_write_queue_head() at two points,
I think it is safe to assume, that it will be valid in
retransmits_timed_out() as well.
Also I have checked, that any call of tcp_transmit_skb() is preceded by
an update of cb->when. So it should be set and valid in
retransmits_timed_out(). Do you agree?
Anyway, how to proceed now? Should I make a patch and post it? If yes, on
which mailing list, which codebase? Or maybe it should be posted in a
series of patches, which also fixes the ENOMEM issue (which I do not really
have a clue about)?
Damian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists