netdev - Re: scp stalls mysteriously

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-id: <4B17A791.80808@tvk.rwth-aachen.de>
Date:	Thu, 03 Dec 2009 12:57:05 +0100
From:	Damian Lukowski <damian@....rwth-aachen.de>
To:	Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
Cc:	Frederic Leroy <fredo@...rox.org>, Netdev <netdev@...r.kernel.org>,
	Asdo <asdo@...ftmail.org>, David Miller <davem@...emloft.net>,
	Eric Dumazet <eric.dumazet@...il.com>,
	Herbert Xu <herbert@...dor.apana.org.au>,
	Greg KH <gregkh@...e.de>
Subject: Re: scp stalls mysteriously

Ilpo Järvinen schrieb:
> I've added Greg as CC to make him aware of this issue in early as it now 
> affects 2.6.32 too (rather important to get dealt quickly in stable once 
> we have a tested solution since TCP is pretty broken with the silent 
> deaths this problem seems to cause). ...One possibility would be to just 
> queue the tested revert to stable and sort this thing out for 2.6.33 in 
> net-2.6.
> 
> Opinions, Dave?, Greg?
> 
> Now back to the issue...
> 
> You said in the other mail that "All further test are on linus-stable 
> tree.", which has this contradiction that Linus does not maintain stable 
> trees. Which exactly was the tree used for the .9. test, Linus' tree or 
> the 2.6.31 stable tree? I suppose the former since the revert wouldn't 
> apply to 2.6.31 so I just want to confirm.
> 
> 
> On Thu, 3 Dec 2009, Frederic Leroy wrote:
>> On Wed, Dec 02, 2009 at 08:17:44PM +0100, Damian Lukowski wrote:
>>> could you please printk retrans_stamp just before the return in 
>>> include/net/tcp.h:retransmits_timed_out()?
>>> If the value is not monotonically increasing but is reset to 0 at some
>>> point, this might lead to problems in tcp_write_timeout().
>>> It's the only idea I have now.
>> Your idea is good.
>> Only one out of 4 value is not null.
>>
>> Logs corresponding on http://wwW.starox.org/pub/scp_stall is .10
>>
>> I make 2 attempts. Printk corresponding to .10 are those after the line 
>> "wlan1 enter promiscuous mode"
> 
> Nice thinking indeed Damian, thanks. ...But but, where exactly did you 
> print? ...There are multiple returns and the return false branch is 
> expected to have a zero retrans_stamp in a typical case but that is not
> a problem because we never use the value.

Yes, it's the retrans_stamp in the subtraction I suspected to be 0.
I also suspect this to happen only in the ca_state < CA_Loss case,
so one first solution might be to return true whenever retrans_stamp == 0.
Unluckily, I still cannot reproduce the scp stalls here, so it would be nice
if Frederic printed retrans_stamp together with icsk_ca_state and
icsk_retransmits, please.

Damian

> ...Anyway, if I'm wrong with my suspicion and it still holds that we have 
> zero retrans_stamp in the substraction too, it could have something to do 
> with this snippet:
> 
> static void tcp_try_to_open(struct sock *sk, int flag)
> {
>         struct tcp_sock *tp = tcp_sk(sk);
> 
>         tcp_verify_left_out(tp);
> 
>         if (!tp->frto_counter && tp->retrans_out == 0)
>                 tp->retrans_stamp = 0;
> 
> ...It bit me last time when FRTO was enabled after very small modification 
> (without running a full verification after the trivial looking 
> modification). ...So I've worked around this clearing for FRTO as you 
> can see :-).
> 
> 
> Also, we have the another mystery to be solved, the fast retransmission is 
> not triggered for some reason (or alternatively not captured in to a 
> log), even in the working .9. case. It would be easy to see whether it 
> works at all from TCP point of view by looking into mibs once you have 
> have some transfers in a working configuration:
> 
> grep -A1 TCP /proc/net/netstat
> 
> ...luckily this fast retransmit issue is less crucial as almost all people 
> are pretty happy already if their RTO-based recovery works even if the 
> fast recovery would not. So figuring it out can be postponed (if one has 
> to prioritize) until the silent death issue is out of the way.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html