[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.00.0911302113390.9826@melkinpaasi.cs.helsinki.fi>
Date: Mon, 30 Nov 2009 22:18:22 +0200 (EET)
From: "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To: Frederic Leroy <fredo@...rox.org>
cc: Netdev <netdev@...r.kernel.org>, Asdo <asdo@...ftmail.org>
Subject: Re: scp stalls mysteriously
On Mon, 30 Nov 2009, Frederic Leroy wrote:
> Le Sat, 28 Nov 2009 13:31:14 +0200 (EET),
> "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi> a écrit :
>
> > On Sat, 28 Nov 2009, Frederic Leroy wrote:
> >
> > > Le Sat, 28 Nov 2009 00:12:23 +0200 (EET),
> > > "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi> a écrit :
> > >
> > > > On Fri, 27 Nov 2009, Frederic Leroy wrote:
> > > >
> > > > > I put traces of stall here :
> > > > > http://www.starox.org/pub/scp_stall/
> > [...]
> > Perhaps having the /proc/net/tcp would at least tell what state the
> > timer is (if I cannot reproduce right away). ...It is rather strange
> > that two independent mechanisms for loss recovery seem both to fail
> > to get triggered here, no traces of retransmission whatsoever. I
> > think it is for now enough to concentrate on what happens on
> > 192.168.1.15 (=houba?) and get tcpdump and proc/net/tcp from there,
> > the other end/direction has very little significance here (except for
> > the fact that bidirectionality might be needed to actually trigger
> > it). You could even think of getting proc/net/tcp a bit more often,
> > right from the start:
> >
> > while [ : ]; do grep ":0016" /proc/net/tcp; sleep 0.1; done | tee
> > scp_stall-houba.x.proc_net_tcp
> >
> > ...Please wait at least 2 minutes before hitting ctrl-c or otherwise
> > artificially intervening.
>
> You're right, 192.168.1.15 is houba.
> The faulty behaviour comes back today, I did new captures on
> 192.168.1.15 only :
> http://www.starox.org/pub/scp_stall/scp_stall-houba.2.pcap
> http://www.starox.org/pub/scp_stall/scp_stall-houba.2.proc_net_tcp
> http://www.starox.org/pub/scp_stall/scp_stall-houba.3.pcap
> http://www.starox.org/pub/scp_stall/scp_stall-houba.3.proc_net_tcp
>
> Which tool do you use to analyze /proc/net/tcp ?
Thanks, at least ....3.... shows that cwnd (congestion window, the 2nd to
last field in /proc/net/tcp) goes to zero which it never ever should.
...I'll start reading some code to figure out the cause for that.
How much is your RTT btw? The RTO timeout easily goes up to 120 secods
similar to what I found myself already. Thus I'm not sure if this stall
would be "infinite" after all.
I've some problem in .3. case in matching timing of the proc/net/tcp dump
and the tcpdump ... did you perhaps cut it somewhere? This might point out
to yet another problem, however, I'm not sure what I can atm trust in the
logs. ...Timestamps would probably help me a bit (I've changed also sleep
to do 0.01 granularity):
while [ : ]; do date "+%s.%N"; grep ":0016" /proc/net/tcp; sleep 0.01; done | tee scp_stall-houba.x.proc_net_tcp
(Please run start while loop before starting the transfer and all the way
until the connection is already killed and only then stop the loop).
What happened in .2. case I've very little idea, proc/net/tcp does claim
retransmission would have happened on TCP layer but that never went into
tcpdump log (the kind of thing might have happened in the .3. case too
which is why I asked for the timestamps for matching captures more exactly
but there were some retransmissions earlier in the log so I cannot be
sure). Can you also check that tcpdump doesn't lose any packets, it
should show: "0 packets dropped by kernel" when you ctrl-c it.
--
i.
Powered by blists - more mailing lists