[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090324130202.GA32469@elte.hu>
Date: Tue, 24 Mar 2009 14:02:02 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Linus Torvalds <torvalds@...ux-foundation.org>,
Herbert Xu <herbert@...dor.apana.org.au>,
Frank Blaschka <blaschka@...ux.vnet.ibm.com>,
"David S. Miller" <davem@...emloft.net>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Revert "gro: Fix legacy path napi_complete crash", (was: Re: Linux
2.6.29)
Yesterday about half of my testboxes (3 out of 7) started getting
weird networking failures: their network interface just got stuck
completely - no rx and no tx at all. Restarting the interface did
not help.
The failures were highly sporadic and not reproducible - they
triggered in distcc workloads, and on random kernels and seemingly
random .config's.
After spending most of today trying to find a good reproducer (my
regular tests werent specific enough to catch it in any bisectable
manner), i settled down on 4 parallel instances of TCP traffic:
nohup ssh testbox yes &
nohup ssh testbox yes &
nohup ssh testbox yes &
nohup ssh testbox yes &
[ over gigabit, forcedeth driver. ]
If the box hung within 15 minutes, the kernel was deemed bad. Using
that method i arrived to this upstream networking fix which was
merged yesterday:
| 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit
| commit 303c6a0251852ecbdc5c15e466dcaff5971f7517
| Author: Herbert Xu <herbert@...dor.apana.org.au>
| Date: Tue Mar 17 13:11:29 2009 -0700
|
| gro: Fix legacy path napi_complete crash
Applying the straight revert below cured the problem - i now have 10
million packets and 30 minutes of uptime and the box is still fine.
bisection log:
[ 10 iterations ] good: 73bc6e1: Merge branch 'linus'
[ 3 iterations ] bad: 4eac7d0: Merge branch 'irq/threaded'
[ 6.0m packets ] good: e17bbdb: Merge branch 'tracing/core'
[ 0.1m packets ] bad: 8e0ee43: Linux 2.6.29
[ 0.1m packets ] bad: e2fc4d1: dca: add missing copyright/license headers
[ 0.2m packets ] bad: 4783256: virtio_net: Make virtio_net support carrier detection
[ 0.4m packets ] bad: 4ada810: Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf
[ 7.0m packets ] good: ec8d540: netfilter: conntrack: fix dropping packet after l4proto->packet()
[ 4.0m packets ] good: d1238d5: netfilter: conntrack: check for NEXTHDR_NONE before header sanity checking
[ 0.1m packets ] bad: 303c6a0: gro: Fix legacy path napi_complete crash
(the first column is millions of packets tested.)
Looking at this commit also explains the assymetric test pattern i
found amongst boxes: all boxes with a new-style NAPI driver (e1000e)
work - the others (forcedeth, 5c9x/vortex) have stuck interfaces.
I've attached the reproducer (non-SMP) .config. The system has:
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
[ 34.722154] forcedeth: Reverse Engineered nForce ethernet driver. Version 0.62.
[ 34.729406] forcedeth 0000:00:0a.0: setting latency timer to 64
[ 34.735320] nv_probe: set workaround bit for reversed mac addr
[ 35.265783] PM: Adding info for No Bus:eth0
[ 35.270877] forcedeth 0000:00:0a.0: ifname eth0, PHY OUI 0x5043 @ 1, addr 00:13:d4:dc:41:12
[ 35.279086] forcedeth 0000:00:0a.0: highdma csum timirq gbit lnktim desc-v3
[ 35.286273] initcall init_nic+0x0/0x16 returned 0 after 550966 usecs
( but the bug does not seem to be driver specific - old-style NAPI
seems to be enough to trigger it. )
Please let me know if you need more info or if i can help with
testing a different patch. Bisecting it was hard, but testing
whether a fix patch does the trick will be a lot easier, as all
the testboxes are back in working order now.
Thanks,
Ingo
Signed-off-by: Ingo Molnar <mingo@...e.hu>
---
net/core/dev.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
Index: linux2/net/core/dev.c
===================================================================
--- linux2.orig/net/core/dev.c
+++ linux2/net/core/dev.c
@@ -2588,9 +2588,9 @@ static int process_backlog(struct napi_s
local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);
if (!skb) {
+ __napi_complete(napi);
local_irq_enable();
- napi_complete(napi);
- goto out;
+ break;
}
local_irq_enable();
@@ -2599,7 +2599,6 @@ static int process_backlog(struct napi_s
napi_gro_flush(napi);
-out:
return work;
}
View attachment "config" of type "text/plain" (67720 bytes)
Powered by blists - more mailing lists