netdev - Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080617080958.GC12535@elte.hu>
Date:	Tue, 17 Jun 2008 10:09:58 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	David Miller <davem@...emloft.net>
Cc:	kuznet@....inr.ac.ru, vgusev@...nvz.org, mcmanus@...ksong.com,
	xemul@...nvz.org, netdev@...r.kernel.org,
	ilpo.jarvinen@...sinki.fi, linux-kernel@...r.kernel.org,
	e1000-devel@...ts.sourceforge.net
Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets


* David Miller <davem@...emloft.net> wrote:

> From: Ingo Molnar <mingo@...e.hu>
> Date: Tue, 17 Jun 2008 09:26:58 +0200
> 
> > So since there's no clear bug pattern and no sure reproducability on 
> > my side i'd suggest we track this problem separately and "do 
> > nothing" right now. I've excluded this warning from my 'is the 
> > freshly booted kernel buggy' list of conditions of -tip testing so 
> > it's not holding me up.
> 
> I'm going to push the revert through just to be safe and I think it's 
> a good idea to do so because all of those defer accept changes should 
> be resubmitted as a group for 2.6.27

okay - in that case the full revert is well-tested on my side as well, 
fwiw.

Tested-by: Ingo Molnar <mingo@...e.hu>

> > and i can apply any test-patch if that would be helpful - if it does 
> > a WARN_ON() i'll notice it. (pure extra debug printks with no stack 
> > trace are much harder to notice in automated tests)
> 
> I don't have time to work on your bug, sorry.  Someone else will have 
> to step forward and help you with it.

it's not really "my bug" - i just offered help to debug someone else's 
bug :-) This is pretty common hw so i guess there will be such reports.

Let me describe what i'm doing exactly: i do a lot of randomized testing 
on about a dozen real systems (all across the x86 spectrum) so i tend to 
trigger a lot of mainline bugs pretty early on.

My collection of kernel bugs for the last 8 months shows 1285 bugs 
(kernel crashes or build failures - about 50%/50%) triggered. One 
test-system alone has a serial log of 15 gigabytes - and there's a dozen 
of them. That's about 5 kernel bugs a day handled by me, on average.

These systems have about 10 times the hardware variability of your 
Niagara system for example, and many of them are rather difficult to 
debug (laptops without serial port, etc.). So i physically cannot avoid 
and debug all bugs on all my test-systems, like you do on the Niagara. I 
will report bugs, i'll bisect anything that is bisectable (on average i 
bisect once a day), and i can add patches and report any test-results, 
and i'll of course debug any bugs that look like heavy mainline 
showstoppers.

> FWIW I don't think your TX timeout problem has anything to do with 
> packet ordering.  The TX element of the network device is totally 
> stateless, but it's hanging under some set of circumstances to the 
> point where we timeout and reset the hardware to get it going again.

ok. That's e1000 then. Cc:s added. Stock T60 laptop, 32-bit:

02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
        Subsystem: Lenovo ThinkPad T60
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at ee000000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 2000 [size=32]
        Capabilities: <access denied>
        Kernel driver in use: e1000

the problem is this non-fatal warning showing up after bootup, 
sporadically, in a non-reproducible way:

[  173.354049] NETDEV WATCHDOG: eth0: transmit timed out
[  173.354148] ------------[ cut here ]------------
[  173.354221] WARNING: at net/sched/sch_generic.c:222 dev_watchdog+0x9a/0xec()
[  173.354298] Modules linked in:
[  173.354421] Pid: 13452, comm: cc1 Tainted: G        W 2.6.26-rc6-00273-g81ae43a-dirty #2573
[  173.354516]  [<c01250ca>] warn_on_slowpath+0x46/0x76
[  173.354641]  [<c011d428>] ? try_to_wake_up+0x1d6/0x1e0
[  173.354815]  [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
[  173.357370]  [<c011d43d>] ? default_wake_function+0xb/0xd
[  173.357370]  [<c014112a>] ? trace_hardirqs_off_caller+0x15/0xc9
[  173.357370]  [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
[  173.357370]  [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
[  173.357370]  [<c0142b33>] ? trace_hardirqs_on_caller+0x16/0x15b
[  173.357370]  [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
[  173.357370]  [<c06bb3c9>] ? _spin_unlock_irqrestore+0x5b/0x71
[  173.357370]  [<c0133d46>] ? __queue_work+0x2d/0x32
[  173.357370]  [<c0134023>] ? queue_work+0x50/0x72
[  173.357483]  [<c0134059>] ? schedule_work+0x14/0x16
[  173.357654]  [<c05c59b8>] dev_watchdog+0x9a/0xec
[  173.357783]  [<c012d456>] run_timer_softirq+0x13d/0x19d
[  173.357905]  [<c05c591e>] ? dev_watchdog+0x0/0xec
[  173.358073]  [<c05c591e>] ? dev_watchdog+0x0/0xec
[  173.360804]  [<c0129ad7>] __do_softirq+0xb2/0x15c
[  173.360804]  [<c0129a25>] ? __do_softirq+0x0/0x15c
[  173.360804]  [<c0105526>] do_softirq+0x84/0xe9
[  173.360804]  [<c0129996>] irq_exit+0x4b/0x88
[  173.360804]  [<c010ec7a>] smp_apic_timer_interrupt+0x73/0x81
[  173.360804]  [<c0103ddd>] apic_timer_interrupt+0x2d/0x34
[  173.360804]  =======================
[  173.360804] ---[ end trace a7919e7f17c0a725 ]---

full report can be found at:

   http://lkml.org/lkml/2008/6/13/224

i have 3 other test-systems with e1000 (with a similar CPU) which are 
_not_ showing this symptom, so this could be some model-specific e1000 
issue.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html