netdev - Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.64.0805261550340.16829@wrl-59.cs.helsinki.fi>
Date:	Mon, 26 May 2008 16:28:18 +0300 (EEST)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	Ingo Molnar <mingo@...e.hu>
cc:	LKML <linux-kernel@...r.kernel.org>,
	Netdev <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Mon, 26 May 2008, Ingo Molnar wrote:

> in an overnight -tip testruns that is based on recent -git i got two 
> stuck TCP connections:
> 
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address               Foreign Address             State      
> tcp        0 174592 10.0.1.14:58015             10.0.1.14:3632              ESTABLISHED 
> tcp    72134      0 10.0.1.14:3632              10.0.1.14:58015             ESTABLISHED 
> 
> on a previously reliable machine. That connection has been stuck for 9 
> hours so it does not time out, etc. - and the distcc run that goes over 
> that connection is stuck as well.
> 
> kernel config is attached.
> 
> in terms of debugging there's not much i can do i'm afraid. It's not 
> possible to get a tcpdump of this incident, given the extreme amount of 
> load these testboxes handle.

...but you can still tcpdump that particular flow once the situation is 
discovered to see if TCP still tries to do something, no? One needs to 
tcpdump couple of minutes at minimum. Also please get /proc/net/tcp for 
that flow around the same time.

> This problem started sometime around rc3 
> and it occured on two boxes (on a laptop and on a desktop), both are SMP 
> Core2Duo based systems. I never saw this problem before on thousands of 
> similar bootups, so i'm 99.9% sure the bug is either new or became 
> easier to trigger.
>
> It's not possible to bisect it as it needs up to 12 hours of heavy 
> workload to trigger. The incident happened about 5 times since the first 
> incident a couple of days ago - 4 times on one box and once on another 
> box. The first failing head i became aware of was 78b58e549a3098. (-tip 
> has other changes beyond -git but changes nothing in networking.)
> 
> One clue (which might or might not matter) is that distcc is one of the 
> very few applications that makes use of sendfile().

Can you please try with /proc/sys/net/ipv4/tcp_frto set to zero though 
recv-q symptom seems weird would it be related to that (but there were 
some recent fixes to FRTO and retrans_stamp change could have some 
significance here)?

Other than that, nothing since -rc1 seems suspicious to me (though
I hardly understand every part of networking).


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html