netdev - Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 2 Jun 2008 11:23:53 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Eric Dumazet <dada1@...mosbay.com>
Cc:	Patrick McManus <mcmanus@...ksong.com>,
	Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>,
	Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Netdev <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Evgeniy Polyakov <johnpol@....mipt.ru>
Subject: Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

* Eric Dumazet <dada1@...mosbay.com> wrote:

>>>  22003 write(3, "distccd[22003] (dcc_listen_by_ad"..., 62) = 62
>>>  22003 listen(4, 10)                     = 0
>>>  22003 setsockopt(4, SOL_TCP, TCP_DEFER_ACCEPT, [1], 4) = 0
>>>
>>> i'll queue up your reverts for testing in -tip.

I turned off localhost distcc two days ago and there has not been a 
single hung socket since then, so we now know it for sure that without 
localhost distcc connections, -tip's QA will not produce any hung 
sockets in about 1000 random-kernel-build+boot iterations.

i've added those reverts this morning and added back the localhost 
distcc rules - we'll see whether the hung sockets are back.

> I believe Ingo problems come on long lived sockets (were many bytes 
> were exchanged between the peers), so I dont think DEFER_ACCEPT is the 
> cullprit.
>
> I suggest to enable CONFIG_TIMER_STATS and to check timers, because 
> /proc/net/tcp can display apparently large timer values when the timer 
> is elapsed (jiffies > icsk->icsk_timeout) and 
> jiffies_to_clock_t(timer_expires - jiffies) is then overflowing doing 
> a multiply and a divide.

i'm wondering whether your suspicion on broken TCP timers is consistent 
with the symptoms i've seen: the hung sockets clearly produced periodic 
packet activity every 180 seconds, up to 8 hours, without ever changing 
their receive of send queue. So at least a part of the TCP timer 
mechanism for that specific stuck socket was working fine.

is there no sysctl or other debug mechanism to somehow get its full TCP 
state and the reasons for why it is stuck? I'm wondering how you debug 
broken TCP state machines without enabling testers to be able to dump 
all state and passing it to developers.

I have a clearly reproducable testcase and i'd like to help out, but the 
whole effort is stalled on 'not enough information' it appears. Doing 
random reverts might help in truly helpless situations where a bug has 
no debuggable state - but this situation seems really routine to me: 
it's very difficult to trigger the bug but once it triggers the bug 
scenario is stable and analyzable. I'd be glad to test any 
instrumentation patch that makes similar scenarios more analyzable.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html