[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20071004202431.542b4caf@freepuppy.rosehill>
Date: Thu, 4 Oct 2007 20:24:31 -0700
From: Stephen Hemminger <shemminger@...ux-foundation.org>
To: Tina Yang <tina.yang@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
bugme-daemon@...zilla.kernel.org, netdev@...r.kernel.org
Subject: Re: [Bugme-new] [Bug 9124] New: Netconsole race crashed the system
On Thu, 04 Oct 2007 18:27:04 -0700
Tina Yang <tina.yang@...cle.com> wrote:
> Andrew Morton wrote:
> > (Please resoind by emailed reply-to-all, not via the bugzilla web interface)
> >
> > On Thu, 4 Oct 2007 16:24:18 -0700 (PDT)
> > bugme-daemon@...zilla.kernel.org wrote:
> >
> >
> >> http://bugzilla.kernel.org/show_bug.cgi?id=9124
> >>
> >> Summary: Netconsole race crashed the system
> >> Product: Networking
> >> Version: 2.5
> >> KernelVersion: 2.6.9, 2.6.18, 2.6.23
> >> Platform: All
> >> OS/Version: Linux
> >> Tree: Mainline
> >> Status: NEW
> >> Severity: high
> >> Priority: P1
> >> Component: Other
> >> AssignedTo: acme@...stprotocols.net
> >> ReportedBy: tina.yang@...cle.com
> >>
> >>
> >> Most recent kernel where this bug did not occur:
> >> Think the problem has always been there.
> >> Distribution:
> >> Hardware Environment:
> >> DELL PowerEdge 2650 (x86)
> >> DELL PowerEdge 2850(x86_64)
> >> HP ProLiant DL380 G5 (x86_64)
> >> with various NICs - e1000, tg3, bnx2
> >> Software Environment:
> >> 2.6.9, 2.6.18, 2.6.23
> >> Problem Description:
> >> On 2.6.18 found this issue on e1000 and tg3. On mainline 2.6.23-rc* found this
> >> issue on e100,tgs and bnx2. It either panicked
> >> at netdevice.h:890 or hung the system, and sometimes depending
> >> on which NIC are used, the following console message,
> >> e1000:
> >> "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang"
> >> tg3:
> >> "NETDEV WATCHDOG: eth4: transmit timed out"
> >> "tg3: eth4: transmit timed out, resetting"
> >>
> >> Steps to reproduce:
> >> 1. On 2.6.18 (both x86_x86_64) insert netconsole module.(NIC: e1000 and tg3)
> >> 2. Run a moderate io load , preferably fio - one process doing async+directIO
> >> using libaio
> >>
> >> fio jobfile:
> >> [global]
> >> iodepth=1024
> >> iodepth_batch=60
> >> randrepeat=1
> >> size=1024m
> >> directory=/home/oracle
> >> numjobs=2
> >> [job1]
> >> bs=8k
> >> direct=1
> >> ioengine=libaio
> >> rw=randrw
> >> filename=file1:file2
> >>
> >> 3. From second console as root do " echo t > /proc/sysrq-trigger"
> >>
> >> Machine will instantly hang.
> >>
> >>
> >> Crash stack captured on 2.6.9
> >> PANIC: "kernel BUG at include/linux/netdevice.h:888!"
> >> #0 [ 23c5e60] disk_dump at f9ca71a2
> >> #1 [ 23c5e64] printk at 21228d6
> >> #2 [ 23c5e70] freeze_other_cpus at f9ca6ef5
> >> #3 [ 23c5e80] start_disk_dump at f9ca6fa0
> >> #4 [ 23c5e90] try_crashdump at 2133766
> >> #5 [ 23c5e98] die at 2106354
> >> #6 [ 23c5ecc] do_invalid_op at 210672f
> >> #7 [ 23c5f7c] error_code (via invalid_op) at fffecede
> >> EAX: 00000006 EBX: 00200202 ECX: 00000000 EDX: df287000 EBP: e05ca000
> >> DS: 007b ESI: 00000001 ES: 007b EDI: e05ca240
> >> CS: 0060 EIP: f8c82a08 ERR: ffffffff EFLAGS: 00210046
> >> #8 [ 23c5fb8] tg3_poll at f8c82a08
> >> #9 [ 23c5fd0] net_rx_action at 227a8da
> >> #10 [ 23c5fe8] __do_softirq at 2126422
> >> --- <soft IRQ> ---
> >> #0 [25c71cac] do_softirq at 2108460
> >> #1 [25c71cb4] dev_queue_xmit at 227a0d2
> >> #2 [25c71ccc] ip_finish_output at 229288d
> >> #3 [25c71ce4] ip_queue_xmit at 2292fa9
> >> #4 [25c71dac] tcp_transmit_skb at 22a0ff7
> >> #5 [25c71dec] tcp_write_xmit at 22a1901
> >> #6 [25c71e10] tcp_sendmsg at 2297d6d
> >> #7 [25c71e80] sock_aio_write at 2272512
> >> #8 [25c71eec] do_sync_write at 215a444
> >> #9 [25c71f88] vfs_write at 215a53a
> >> #10 [25c71fa4] sys_write at 215a5f4
> >> #11 [25c71fc0] system_call at fffec219
> >>
> >> net_device in memory,
> >> name = "eth0\000\000\000\000\000\000\000\000\000\000\000",
> >> ...
> >>
> >>
> >> Crash stack captured on 2.6.18
> >> PANIC: "kernel BUG at include/linux/netdevice.h:890!"
> >> #0 [c072ce30] crash_kexec at c044418a
> >> #1 [c072ce74] die at c04054d0
> >> #2 [c072cea4] do_invalid_op at c0405c20
> >> #3 [c072cf54] error_code (via invalid_op) at c0404ab3
> >> EAX: 00000007 EBX: 00000202 ECX: 00000000 EDX: f6d9c000 EBP: f6d9c400
> >> DS: 007b ESI: 00000001 ES: 007b EDI: cb02b280
> >> CS: 0060 EIP: f8927791 ERR: ffffffff EFLAGS: 00010046
> >> #4 [c072cf88] tg3_poll at f8927791
> >> --- <soft IRQ> ---
> >> #0 [f7e54f60] do_softirq at c0406433
> >> #1 [f7e54f6c] do_IRQ at c0406425
> >> #2 [f7e54fb4] cpu_idle at c0402c8e
> >>
> >> net_device in memory,
> >> name = "eth4\000\000\000\000\000\000\000\000\000\000\000",
> >> name_hlist = {
> >> next = 0x0,
> >> pprev = 0xc07d0148
> >> },
> >> ...
> >>
> >>
> >
> > OK, but in my 2.6.18, include/linux/netdevice.h:890 is a
> > local_irq_restore() in netif_rx_complete(). I don't see how that can go
> > BUG.
> >
> > Does your 2.6.18 have any patches applied?
> >
> > Please tell us what is at include/linux/netdevice.h:890 in your 2.6.18
> > tree.
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@...r.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> netdevice.h attached.
> 890 BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
>
Comparing your version with the original 2.6.18 from kernel.org git shows:
--- 2.6.18/include/linux/netdevice.h 2007-10-04 20:14:51.000000000 -0700
+++ tina/include/linux//netdevice.h 2007-10-04 20:16:19.000000000 -0700
@@ -342,6 +342,9 @@
/* Instance data managed by the core of Wireless Extensions. */
struct iw_public_data * wireless_data;
+ /* pending config used by cfg80211/wext compat code only */
+ void *cfg80211_wext_pending_config;
+
struct ethtool_ops *ethtool_ops;
/*
@@ -386,6 +389,7 @@
void *ip6_ptr; /* IPv6 specific data */
void *ec_ptr; /* Econet specific data */
void *ax25_ptr; /* AX.25 specific data */
+ void *ieee80211_ptr; /* IEEE 802.11 specific data */
/*
* Cache line mostly used on receive path (including eth_type_trans())
So you are not using a "pure" v2.6.18 kernel from kernel.org but more likely
a distribution kernel that had already integrated the mac80211 stuff.
--
Stephen Hemminger <shemminger@...ux-foundation.org>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists