lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 04 Oct 2007 20:56:50 -0700
From:	Tina Yang <tina.yang@...cle.com>
To:	Stephen Hemminger <shemminger@...ux-foundation.org>
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	bugme-daemon@...zilla.kernel.org, netdev@...r.kernel.org
Subject: Re: [Bugme-new] [Bug 9124] New: Netconsole race crashed the system

Stephen Hemminger wrote:
> On Thu, 04 Oct 2007 18:27:04 -0700
> Tina Yang <tina.yang@...cle.com> wrote:
>
>   
>> Andrew Morton wrote:
>>     
>>> (Please resoind by emailed reply-to-all, not via the bugzilla web interface)
>>>
>>> On Thu,  4 Oct 2007 16:24:18 -0700 (PDT)
>>> bugme-daemon@...zilla.kernel.org wrote:
>>>
>>>   
>>>       
>>>> http://bugzilla.kernel.org/show_bug.cgi?id=9124
>>>>
>>>>            Summary: Netconsole race crashed the system
>>>>            Product: Networking
>>>>            Version: 2.5
>>>>      KernelVersion: 2.6.9, 2.6.18, 2.6.23
>>>>           Platform: All
>>>>         OS/Version: Linux
>>>>               Tree: Mainline
>>>>             Status: NEW
>>>>           Severity: high
>>>>           Priority: P1
>>>>          Component: Other
>>>>         AssignedTo: acme@...stprotocols.net
>>>>         ReportedBy: tina.yang@...cle.com
>>>>
>>>>
>>>> Most recent kernel where this bug did not occur:
>>>> Think the problem has always been there.
>>>> Distribution:
>>>> Hardware Environment:
>>>> DELL PowerEdge 2650 (x86)
>>>> DELL PowerEdge 2850(x86_64)
>>>> HP ProLiant DL380 G5 (x86_64) 
>>>> with various NICs - e1000, tg3, bnx2
>>>> Software Environment:
>>>> 2.6.9, 2.6.18, 2.6.23
>>>> Problem Description:
>>>> On 2.6.18 found this issue on e1000 and tg3. On mainline 2.6.23-rc* found this
>>>>  issue on e100,tgs and bnx2.  It either panicked
>>>> at netdevice.h:890 or hung the system, and sometimes depending
>>>> on which NIC are used, the following console message,
>>>>  e1000:
>>>>       "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang"
>>>>  tg3:
>>>>       "NETDEV WATCHDOG: eth4: transmit timed out"
>>>>       "tg3: eth4: transmit timed out, resetting"
>>>>
>>>> Steps to reproduce:
>>>> 1. On 2.6.18 (both x86_x86_64) insert netconsole module.(NIC: e1000 and tg3)
>>>> 2. Run a moderate io load , preferably fio - one process doing async+directIO
>>>> using libaio 
>>>>
>>>> fio jobfile:
>>>> [global]
>>>> iodepth=1024
>>>> iodepth_batch=60
>>>> randrepeat=1
>>>> size=1024m
>>>> directory=/home/oracle
>>>> numjobs=2
>>>> [job1]
>>>> bs=8k
>>>> direct=1
>>>> ioengine=libaio
>>>> rw=randrw
>>>> filename=file1:file2
>>>>
>>>> 3. From second console as root do " echo t > /proc/sysrq-trigger"
>>>>
>>>> Machine will instantly hang.
>>>>
>>>>
>>>> Crash stack captured on 2.6.9
>>>>        PANIC: "kernel BUG at include/linux/netdevice.h:888!"
>>>> #0 [ 23c5e60] disk_dump at f9ca71a2
>>>> #1 [ 23c5e64] printk at 21228d6
>>>> #2 [ 23c5e70] freeze_other_cpus at f9ca6ef5
>>>> #3 [ 23c5e80] start_disk_dump at f9ca6fa0
>>>> #4 [ 23c5e90] try_crashdump at 2133766
>>>> #5 [ 23c5e98] die at 2106354
>>>> #6 [ 23c5ecc] do_invalid_op at 210672f
>>>> #7 [ 23c5f7c] error_code (via invalid_op) at fffecede
>>>>    EAX: 00000006  EBX: 00200202  ECX: 00000000  EDX: df287000  EBP: e05ca000
>>>>    DS:  007b      ESI: 00000001  ES:  007b      EDI: e05ca240 
>>>>    CS:  0060      EIP: f8c82a08  ERR: ffffffff  EFLAGS: 00210046 
>>>> #8 [ 23c5fb8] tg3_poll at f8c82a08
>>>> #9 [ 23c5fd0] net_rx_action at 227a8da
>>>> #10 [ 23c5fe8] __do_softirq at 2126422
>>>> --- <soft IRQ> ---
>>>> #0 [25c71cac] do_softirq at 2108460
>>>> #1 [25c71cb4] dev_queue_xmit at 227a0d2
>>>> #2 [25c71ccc] ip_finish_output at 229288d
>>>> #3 [25c71ce4] ip_queue_xmit at 2292fa9
>>>> #4 [25c71dac] tcp_transmit_skb at 22a0ff7
>>>> #5 [25c71dec] tcp_write_xmit at 22a1901
>>>> #6 [25c71e10] tcp_sendmsg at 2297d6d
>>>> #7 [25c71e80] sock_aio_write at 2272512
>>>> #8 [25c71eec] do_sync_write at 215a444
>>>> #9 [25c71f88] vfs_write at 215a53a
>>>> #10 [25c71fa4] sys_write at 215a5f4
>>>> #11 [25c71fc0] system_call at fffec219 
>>>>
>>>> net_device in memory,
>>>>   name = "eth0\000\000\000\000\000\000\000\000\000\000\000", 
>>>>  ...
>>>>
>>>>
>>>> Crash stack captured on 2.6.18
>>>>        PANIC: "kernel BUG at include/linux/netdevice.h:890!"
>>>>  #0 [c072ce30] crash_kexec at c044418a
>>>>  #1 [c072ce74] die at c04054d0
>>>>  #2 [c072cea4] do_invalid_op at c0405c20
>>>>  #3 [c072cf54] error_code (via invalid_op) at c0404ab3
>>>>     EAX: 00000007  EBX: 00000202  ECX: 00000000  EDX: f6d9c000  EBP: f6d9c400 
>>>>     DS:  007b      ESI: 00000001  ES:  007b      EDI: cb02b280 
>>>>     CS:  0060      EIP: f8927791  ERR: ffffffff  EFLAGS: 00010046 
>>>>  #4 [c072cf88] tg3_poll at f8927791
>>>> --- <soft IRQ> ---
>>>>  #0 [f7e54f60] do_softirq at c0406433
>>>>  #1 [f7e54f6c] do_IRQ at c0406425
>>>>  #2 [f7e54fb4] cpu_idle at c0402c8e
>>>>
>>>> net_device in memory,
>>>>   name = "eth4\000\000\000\000\000\000\000\000\000\000\000", 
>>>>   name_hlist = {
>>>>     next = 0x0, 
>>>>     pprev = 0xc07d0148
>>>>   }, 
>>>>   ...
>>>>
>>>>     
>>>>         
>>> OK, but in my 2.6.18, include/linux/netdevice.h:890 is a
>>> local_irq_restore() in netif_rx_complete().  I don't see how that can go
>>> BUG.
>>>
>>> Does your 2.6.18 have any patches applied?
>>>
>>> Please tell us what is at include/linux/netdevice.h:890 in your 2.6.18
>>> tree.
>>>
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@...r.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>   
>>>       
>>     netdevice.h attached.
>>     890         BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
>>    
>>     
>
> Comparing your version with the original 2.6.18 from kernel.org git shows:
>
> --- 2.6.18/include/linux/netdevice.h	2007-10-04 20:14:51.000000000 -0700
> +++ tina/include/linux//netdevice.h	2007-10-04 20:16:19.000000000 -0700
> @@ -342,6 +342,9 @@
>  	/* Instance data managed by the core of Wireless Extensions. */
>  	struct iw_public_data *	wireless_data;
>  
> +	/* pending config used by cfg80211/wext compat code only */
> +	void *cfg80211_wext_pending_config;
> +
>  	struct ethtool_ops *ethtool_ops;
>  
>  	/*
> @@ -386,6 +389,7 @@
>  	void                    *ip6_ptr;       /* IPv6 specific data */
>  	void			*ec_ptr;	/* Econet specific data	*/
>  	void			*ax25_ptr;	/* AX.25 specific data */
> +	void			*ieee80211_ptr;	/* IEEE 802.11 specific data */
>  
>  /*
>   * Cache line mostly used on receive path (including eth_type_trans())
>
>
> So you are not using a "pure" v2.6.18 kernel rom kernel.org but more likely
> a distribution kernel that had already integrated the mac80211 stuff.
>
>
>   
    Yes, it's RHEL5 2.6.18-8.  Attached is the 2.6.9-42 version that 
doesn't have 802.11 and
    crashed at the same spot - netdevice.h:888.  Also crashed are 
2.6.23-rc2 and rc4.



View attachment "netdevice.h" of type "text/plain" (29555 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ