netdev - Re: 2.6.33-rc5: (e1000): transmit queue 0 timed out

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4807377b1001261712m9d161b6s700cc76b2f2edaf4@mail.gmail.com>
Date:	Tue, 26 Jan 2010 17:12:16 -0800
From:	Jesse Brandeburg <jesse.brandeburg@...il.com>
To:	Alexander Beregalov <a.beregalov@...il.com>,
	Jesse Brandeburg <jesse.brandeburg@...el.com>
Cc:	"Rafael J. Wysocki" <rjw@...k.pl>, netdev <netdev@...r.kernel.org>,
	e1000-devel@...ts.sourceforge.net
Subject: Re: 2.6.33-rc5: (e1000): transmit queue 0 timed out

I also just noticed something else.

On Sat, Jan 23, 2010 at 12:04 PM, Alexander Beregalov
<a.beregalov@...il.com> wrote:
>>> Pid: 5, comm: events/0 Tainted: G        W  2.6.33-rc5 #1
>>> NF7-S/NF7,NF7-V (nVidia-nForce2)/
>>> EIP: 0060:[<c1071c51>] EFLAGS: 00010282 CPU: 0
>>> EIP is at put_page+0x11/0x120
>>> EAX: 2e8ca4f3 EBX: 2e8ca4f3 ECX: 00000000 EDX: ee960640
>>> ESI: f6482620 EDI: 000016b0 EBP: f7065ea8 ESP: f7065e98
>>>  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>> Process events/0 (pid: 5, ti=f7064000 task=f70553c0 task.ti=f7064000)
>>> Stack:
>>>  00000206 00000001 f6482620 000016b0 f7065eb8 c12d3100 f6482620 f71d9f50
>>> <0> f7065ec4 c12d2e32 f80376b0 f7065ecc c12d2ec5 f7065f00 c1276970 cccccccd
>>> <0> f7065f00 f711fafc f711fafc f711faa0 00000000 f702b440 000000f2 f702b440
>>> Call Trace:
>>>  [<c12d3100>] ? skb_release_data+0x90/0xa0
>>>  [<c12d2e32>] ? __kfree_skb+0x12/0x90
>>>  [<c12d2ec5>] ? consume_skb+0x15/0x30
>>>  [<c1276970>] ? e1000_clean_rx_ring+0x80/0x150
>>>  [<c127c743>] ? e1000_down+0x1b3/0x1d0
>>>  [<c127cf60>] ? e1000_reset_task+0x0/0x10
>>>  [<c127cd3b>] ? e1000_reinit_locked+0x4b/0x70
>>>  [<c127cf6d>] ? e1000_reset_task+0xd/0x10
>>>  [<c103a9ea>] ? worker_thread+0x14a/0x230
>>>  [<c103a989>] ? worker_thread+0xe9/0x230
>>>  [<c103e160>] ? autoremove_wake_function+0x0/0x40
>>>  [<c103a8a0>] ? worker_thread+0x0/0x230
>>>  [<c103de6c>] ? kthread+0x6c/0x80
>>>  [<c103de00>] ? kthread+0x0/0x80
>>>  [<c100303a>] ? kernel_thread_helper+0x6/0x1c
> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1c5/0x1d0()
> Hardware name:
> NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out

There are at least two problems here, not sure if they are related yet.
The first is the WATCHDOG (the kernel will print this only once per
e1000e driver load), we don't know what is causing that right now,
maybe we can mock up some ftrace magic to dump the transmit
descriptors that are formed, or you can run the e1000_dump patch code.
 Let me know if you want me to generate a version of it for you.

> <..>
> BUG kmalloc-2048: Poison overwritten
> -----------------------------------------------------------------------------
>
> INFO: 0xf4998022-0xf4998607. First byte 0x0 instead of 0x6b
> INFO: Allocated in __netdev_alloc_skb+0x1e/0x40 age=372 cpu=0 pid=1724
> INFO: Freed in skb_release_data+0x68/0xa0 age=292 cpu=0 pid=5
> INFO: Slab 0xc2283300 objects=15 used=0 fp=0xf499b950 flags=0x40004082
> INFO: Object 0xf4998000 @offset=0 fp=0xf499d1e0
>
>  Object 0xf4998000:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> kkkkkkkkkkkkkkkk
>  Object 0xf4998010:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> kkkkkkkkkkkkkkkk
>  Object 0xf4998020:  6b 6b 00 07 e9 09 d4 79 01 14 2b 09 0b 28 08 00
> kk..И.тy..+..(..
>  Object 0xf4998030:  45 20 05 d4 7d 09 40 00 75 06 3c e5 d5 b6 b0 b4
> E..т}.@.u.<Еу╤╟╢
>  Object 0xf4998040:  c1 a8 01 02 23 28 af a3 57 1f 82 b2 c7 a5 14 d5
> а╗..#(╞ёW..╡г╔.у
> <..>

hey, thats an ethernet/ipv4 packet!  The memory above is typically
allocated at a 2kB boundary and then the hardware would start DMAing
at 000+22...  how long was that packet corrupting data in memory (how
many bytes)?

00 07 e9 09 d4 79
dest mac
01 14 2b 09 0b 28
src mac address
08 00
ip header
45 20
header length 20 bytes, ip v4, DSCP = ECN capable
and on...

so this is the second issue, we've called kfree on a packet but
hardware still receives into it (probably that we didn't wait long
enough for receives to quit)

>  Object 0xf49987d0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> kkkkkkkkkkkkkkkk
>  Object 0xf49987e0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> kkkkkkkkkkkkkkkk
>  Object 0xf49987f0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5
> kkkkkkkkkkkkkkk╔
>  Redzone 0xf4998800:  bb bb bb bb                                     ╩╩╩╩
>  Padding 0xf4998828:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ
> Pid: 1719, comm: rtorrent Tainted: G        W  2.6.33-rc5 #1
> Call Trace:
>  [<c108d1af>] print_trailer+0xcf/0x120
>  [<c108d844>] check_bytes_and_report+0xc4/0xf0
>  [<c108da1f>] check_object+0x1af/0x200
>  [<c108db4a>] __free_slab+0xda/0x100
>  [<c108db8c>] discard_slab+0x1c/0x30
>  [<c108f552>] __slab_free+0xd2/0x280
>  [<c11d7156>] ? copy_to_user+0x36/0x130
>  [<c108f9df>] kfree+0xdf/0x110
>  [<c12d30d8>] ? skb_release_data+0x68/0xa0
>  [<c12d30d8>] ? skb_release_data+0x68/0xa0
>  [<c12d30d8>] skb_release_data+0x68/0xa0
>  [<c12d2e32>] __kfree_skb+0x12/0x90
>  [<c13006a0>] tcp_recvmsg+0x6c0/0x8d0
>  [<c102f691>] ? local_bh_enable_ip+0x61/0xc0
>  [<c1350b75>] ? _raw_spin_unlock_bh+0x25/0x30
>  [<c10435e5>] ? T.324+0x15/0x1b0
>  [<c12cdc73>] sock_common_recvmsg+0x43/0x60
>  [<c12cbc87>] sock_recvmsg+0xb7/0xf0
>  [<c10435e5>] ? T.324+0x15/0x1b0
>  [<c12cc589>] sys_recvfrom+0x79/0xe0
>  [<c104d66b>] ? trace_hardirqs_off+0xb/0x10
>  [<c104398e>] ? cpu_clock+0x4e/0x60
>  [<c104d6a7>] ? lock_release_holdtime+0x37/0x1b0
>  [<c1052121>] ? lock_release_non_nested+0x301/0x340
>  [<c104d6a7>] ? lock_release_holdtime+0x37/0x1b0
>  [<c107bd0a>] ? might_fault+0x4a/0xa0
>  [<c12cc626>] sys_recv+0x36/0x40
>  [<c12cd74c>] sys_socketcall+0x1ac/0x270
>  [<c11d68f4>] ? trace_hardirqs_on_thunk+0xc/0x10
>  [<c1002b10>] sysenter_do_call+0x12/0x36
> FIX kmalloc-2048: Restoring 0xf4998022-0xf4998607=0x6b

I'll reply here with a patch for checking if the receive unit is still
not stopped, but I still don't know what is causing the NETDEV
WATCHDOG and we need some more debug information for that (from the
e1000_dump patch)

can you answer any of my other questions too?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html