netdev - Re: [Bugme-new] [Bug 9124] New: Netconsole race crashed the system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <470592E8.7080809@oracle.com>
Date:	Thu, 04 Oct 2007 18:27:04 -0700
From:	Tina Yang <tina.yang@...cle.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
CC:	bugme-daemon@...zilla.kernel.org, netdev@...r.kernel.org
Subject: Re: [Bugme-new] [Bug 9124] New: Netconsole race crashed the system

Andrew Morton wrote:
> (Please resoind by emailed reply-to-all, not via the bugzilla web interface)
>
> On Thu,  4 Oct 2007 16:24:18 -0700 (PDT)
> bugme-daemon@...zilla.kernel.org wrote:
>
>   
>> http://bugzilla.kernel.org/show_bug.cgi?id=9124
>>
>>            Summary: Netconsole race crashed the system
>>            Product: Networking
>>            Version: 2.5
>>      KernelVersion: 2.6.9, 2.6.18, 2.6.23
>>           Platform: All
>>         OS/Version: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: high
>>           Priority: P1
>>          Component: Other
>>         AssignedTo: acme@...stprotocols.net
>>         ReportedBy: tina.yang@...cle.com
>>
>>
>> Most recent kernel where this bug did not occur:
>> Think the problem has always been there.
>> Distribution:
>> Hardware Environment:
>> DELL PowerEdge 2650 (x86)
>> DELL PowerEdge 2850(x86_64)
>> HP ProLiant DL380 G5 (x86_64) 
>> with various NICs - e1000, tg3, bnx2
>> Software Environment:
>> 2.6.9, 2.6.18, 2.6.23
>> Problem Description:
>> On 2.6.18 found this issue on e1000 and tg3. On mainline 2.6.23-rc* found this
>>  issue on e100,tgs and bnx2.  It either panicked
>> at netdevice.h:890 or hung the system, and sometimes depending
>> on which NIC are used, the following console message,
>>  e1000:
>>       "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang"
>>  tg3:
>>       "NETDEV WATCHDOG: eth4: transmit timed out"
>>       "tg3: eth4: transmit timed out, resetting"
>>
>> Steps to reproduce:
>> 1. On 2.6.18 (both x86_x86_64) insert netconsole module.(NIC: e1000 and tg3)
>> 2. Run a moderate io load , preferably fio - one process doing async+directIO
>> using libaio 
>>
>> fio jobfile:
>> [global]
>> iodepth=1024
>> iodepth_batch=60
>> randrepeat=1
>> size=1024m
>> directory=/home/oracle
>> numjobs=2
>> [job1]
>> bs=8k
>> direct=1
>> ioengine=libaio
>> rw=randrw
>> filename=file1:file2
>>
>> 3. From second console as root do " echo t > /proc/sysrq-trigger"
>>
>> Machine will instantly hang.
>>
>>
>> Crash stack captured on 2.6.9
>>        PANIC: "kernel BUG at include/linux/netdevice.h:888!"
>> #0 [ 23c5e60] disk_dump at f9ca71a2
>> #1 [ 23c5e64] printk at 21228d6
>> #2 [ 23c5e70] freeze_other_cpus at f9ca6ef5
>> #3 [ 23c5e80] start_disk_dump at f9ca6fa0
>> #4 [ 23c5e90] try_crashdump at 2133766
>> #5 [ 23c5e98] die at 2106354
>> #6 [ 23c5ecc] do_invalid_op at 210672f
>> #7 [ 23c5f7c] error_code (via invalid_op) at fffecede
>>    EAX: 00000006  EBX: 00200202  ECX: 00000000  EDX: df287000  EBP: e05ca000
>>    DS:  007b      ESI: 00000001  ES:  007b      EDI: e05ca240 
>>    CS:  0060      EIP: f8c82a08  ERR: ffffffff  EFLAGS: 00210046 
>> #8 [ 23c5fb8] tg3_poll at f8c82a08
>> #9 [ 23c5fd0] net_rx_action at 227a8da
>> #10 [ 23c5fe8] __do_softirq at 2126422
>> --- <soft IRQ> ---
>> #0 [25c71cac] do_softirq at 2108460
>> #1 [25c71cb4] dev_queue_xmit at 227a0d2
>> #2 [25c71ccc] ip_finish_output at 229288d
>> #3 [25c71ce4] ip_queue_xmit at 2292fa9
>> #4 [25c71dac] tcp_transmit_skb at 22a0ff7
>> #5 [25c71dec] tcp_write_xmit at 22a1901
>> #6 [25c71e10] tcp_sendmsg at 2297d6d
>> #7 [25c71e80] sock_aio_write at 2272512
>> #8 [25c71eec] do_sync_write at 215a444
>> #9 [25c71f88] vfs_write at 215a53a
>> #10 [25c71fa4] sys_write at 215a5f4
>> #11 [25c71fc0] system_call at fffec219 
>>
>> net_device in memory,
>>   name = "eth0\000\000\000\000\000\000\000\000\000\000\000", 
>>  ...
>>
>>
>> Crash stack captured on 2.6.18
>>        PANIC: "kernel BUG at include/linux/netdevice.h:890!"
>>  #0 [c072ce30] crash_kexec at c044418a
>>  #1 [c072ce74] die at c04054d0
>>  #2 [c072cea4] do_invalid_op at c0405c20
>>  #3 [c072cf54] error_code (via invalid_op) at c0404ab3
>>     EAX: 00000007  EBX: 00000202  ECX: 00000000  EDX: f6d9c000  EBP: f6d9c400 
>>     DS:  007b      ESI: 00000001  ES:  007b      EDI: cb02b280 
>>     CS:  0060      EIP: f8927791  ERR: ffffffff  EFLAGS: 00010046 
>>  #4 [c072cf88] tg3_poll at f8927791
>> --- <soft IRQ> ---
>>  #0 [f7e54f60] do_softirq at c0406433
>>  #1 [f7e54f6c] do_IRQ at c0406425
>>  #2 [f7e54fb4] cpu_idle at c0402c8e
>>
>> net_device in memory,
>>   name = "eth4\000\000\000\000\000\000\000\000\000\000\000", 
>>   name_hlist = {
>>     next = 0x0, 
>>     pprev = 0xc07d0148
>>   }, 
>>   ...
>>
>>     
>
> OK, but in my 2.6.18, include/linux/netdevice.h:890 is a
> local_irq_restore() in netif_rx_complete().  I don't see how that can go
> BUG.
>
> Does your 2.6.18 have any patches applied?
>
> Please tell us what is at include/linux/netdevice.h:890 in your 2.6.18
> tree.
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

    netdevice.h attached.
    890         BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
   

View attachment "netdevice.h" of type "text/plain" (32238 bytes)