[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20091229103310.46b8c670@pluto.restena.lu>
Date: Tue, 29 Dec 2009 10:33:10 +0100
From: Bruno Prémont <bonbons@...ux-vserver.org>
To: "Benjamin Li" <benli@...adcom.com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"Michael Chan" <mchan@...adcom.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: BNX2: Kernel crashes with 2.6.31 and 2.6.31.9
Hi Benjamin,
On Tue, 29 Dec 2009 01:05:40 "Benjamin Li" <benli@...adcom.com> wrote:
> Hi Bruno,
>
> It looks like the the NULL dereference is happening at a0fc.
>
> a0f8: 48 8b 42 70 mov 0x70(%rdx),%rax
> a0fc: 0f b7 10 movzwl (%rax),%edx
> a0ff: 31 c0 xor %eax,%eax
Thanks for confirming my guess
> The offset of 0x70 is the bp field in the bnx2_napi structure. (Seen
> in the bnx2_napi structure dump below) These lines are found in the
> routine, bnx2_get_hw_tx_cons() which look like they were inlined by
> the compiler. More specifically it looks like the dereference of the
> hw_tx_cons_ptr failed.
>
> cons = *bnapi->hw_tx_cons_ptr;
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/net/bnx2.c;h=06b901152d4487fa04164437cc179661b44657fe;hb=74fca6a42863ffacaf7ba6f1936a9f228950f657#l2761
>
> To be sure this is the case, could you send the .config file you are
> using or if you could send me the bnx2 kernel module built with the
> CFLAG '-g', then we can definitely verify where in the code it is
> crashing.
See attached .config, if needed I can recompile with the module with
'-g', but the original instance does not contain debugging info.
> Did you see anything suspicious in the system kernel logs? If you
> could isolate the logs from when the machine booted to when it crash
> and send it to us it would be very helpful.
Unfortunately there is nothing suspicious in there, all I have is
attached dmesg (with IP addresses, MAC addresses replaced by '*'s)
I've not appended the crash dump gathered via netconsole which didn't
make it to the affected system's disk (see previous mail for it).
Regards,
Bruno
> Thanks again for your time.
>
> -Ben
>
>
> <--snip snip structure dump from pahole-->
> struct bnx2_napi {
> struct napi_struct napi; /* 0
> 96 */
> /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
> struct bnx2 * bp; /* 96
> 8 */
> union {
> struct status_block * msi; /*
> 8 */
> struct status_block_msix * msix; /*
> 8 */
> } status_blk; /* 104
> 8 */
> u16 * hw_tx_cons_ptr; /* 112
> 8 */
> u16 * hw_rx_cons_ptr; /* 120
> 8 */
> /* --- cacheline 2 boundary (128 bytes) --- */
> u32 last_status_idx; /* 128
> 4 */
> u32 int_num; /* 132
> 4 */
> struct bnx2_rx_ring_info rx_ring; /* 136
> 360 */
> /* --- cacheline 7 boundary (448 bytes) was 48 bytes ago ---
> */ struct bnx2_tx_ring_info tx_ring; /* 496 48
> */
> /* --- cacheline 8 boundary (512 bytes) was 32 bytes ago ---
> */
>
> /* size: 576, cachelines: 9 */
> /* padding: 32 */
> };
> <--snip snip-->
>
> On Mon, 2009-12-28 at 23:49 -0800, Bruno Prémont wrote:
> > On a system that was running 2.6.31 since last September I got two
> > crashes this December at night (cause unknown), yesterday after
> > second crash I updated kernel to 2.6.31.9 and enabled netconsole in
> > the hope to get some information about the cause of the crash.
> >
> > Today system crashed once again and all I got is the following
> > incomplete trace on the receiving side of netconsole:
> >
> > [24701.841185] BUG: unable to handle kernel NULL pointer
> > dereference at (null) [24701.841188] IP: [<ffffffffa00610fc>]
> > bnx2_poll_work+0x2c/0x12d0 [bnx2] [24701.841197] PGD 16509067 PUD
> > 4e776067 PMD 0 [24701.841199] Oops: 0000 [#1] SMP
> > [24701.841202] last sysfs file: /sys/kernel/uevent_seqnum
> > [24701.841204] CPU 0
> > [24701.841205] Modules linked in: ipmi_devintf squashfs ext2
> > zlib_inflate netconsole configfs loop dm_round_robin scsi_dh_rdac
> > dm_multipath scsi_dh dm_mod sg sr_mod cdrom ata_piix i pmi_si
> > ipmi_msghandler qla2xxx ahci bnx2 hpwdt uhci_hcd ehci_hcd libata
> > [24701.841218] Pid: 11273, comm: php-cgi Not tainted
> > 2.6.31.9-x86_64 #1 ProLiant DL360 G5 [24701.841220] RIP:
> > 0010:[<ffffffffa00610fc>] [<ffffffffa00610fc>]
> > bnx2_poll_work+0x2c/0x12d0 [bnx2]
> >
> >
> > Running objdump on the bnx2.ko module I get the following:
> > 000000000000a0d0 <bnx2_poll_work>:
> > a0d0: 41 57 push %r15
> > a0d2: 41 56 push %r14
> > a0d4: 41 55 push %r13
> > a0d6: 41 54 push %r12
> > a0d8: 55 push %rbp
> > a0d9: 53 push %rbx
> > a0da: 48 81 ec 28 01 00 00 sub $0x128,%rsp
> > a0e1: 48 89 7c 24 18 mov %rdi,0x18(%rsp)
> > a0e6: 48 89 74 24 10 mov %rsi,0x10(%rsp)
> > a0eb: 89 54 24 0c mov %edx,0xc(%rsp)
> > a0ef: 89 4c 24 08 mov %ecx,0x8(%rsp)
> > a0f3: 48 8b 54 24 10 mov 0x10(%rsp),%rdx
> > a0f8: 48 8b 42 70 mov 0x70(%rdx),%rax
> > a0fc: 0f b7 10 movzwl (%rax),%edx
> > a0ff: 31 c0 xor %eax,%eax
> > a101: 48 8b 4c 24 10 mov 0x10(%rsp),%rcx
> > a106: 80 fa ff cmp $0xff,%dl
> > a109: 0f 94 c0 sete %al
> > a10c: 01 c2 add %eax,%edx
> > a10e: 66 39 91 1a 02 00 00 cmp %dx,0x21a(%rcx)
> > a115: 0f 84 78 01 00 00 je a293
> > <bnx2_poll_work+0x1c3> a11b: 48 8b 57 08 mov
> > 0x8(%rdi),%rdx a11f: 48 89 f8 mov %rdi,%rax
> > a122: 48 8b 9a 00 03 00 00 mov 0x300(%rdx),%rbx
> > a129: 48 83 c0 40 add $0x40,%rax
> > a12d: 48 29 c1 sub %rax,%rcx
> > a130: 48 89 c8 mov %rcx,%rax
> > a133: 48 c1 f8 06 sar $0x6,%rax
> > a137: 69 c0 39 8e e3 38 imul $0x38e38e39,%eax,%eax
> > a13d: 48 c1 e0 07 shl $0x7,%rax
> > a141: 48 01 d8 add %rbx,%rax
> > a144: 48 89 44 24 20 mov %rax,0x20(%rsp)
> > a149: 48 8b 7c 24 10 mov 0x10(%rsp),%rdi
> > a14e: 48 8b 47 70 mov 0x70(%rdi),%rax
> > a152: 44 0f b7 30 movzwl (%rax),%r14d
> > a156: 31 c0 xor %eax,%eax
> > a158: 0f b7 9f 18 02 00 00 movzwl 0x218(%rdi),%ebx
> > a15f: 41 80 fe ff cmp $0xff,%r14b
> > a163: 0f 94 c0 sete %al
> > a166: 45 31 ff xor %r15d,%r15d
> > a169: 41 01 c6 add %eax,%r14d
> > a16c: 66 44 39 f3 cmp %r14w,%bx
> > a170: 0f 84 ee 00 00 00 je a264
> > <bnx2_poll_work+0x194> a176: 66 2e 0f 1f 84 00 00 nopw
> > %cs:0x0(%rax,%rax,1) a17d: 00 00 00
> > a180: 0f b6 cb movzbl %bl,%ecx
> > a183: 48 8b 44 24 10 mov 0x10(%rsp),%rax
> > a188: 44 0f b7 e1 movzwl %cx,%r12d
> > a18c: 49 c1 e4 04 shl $0x4,%r12
> > a190: 4c 03 a0 10 02 00 00 add 0x210(%rax),%r12
> > a197: 4d 8b 2c 24 mov (%r12),%r13
> > a19b: 66 41 83 7c 24 08 00 cmpw $0x0,0x8(%r12)
> > a1a2: 41 0f 18 8d bc 00 00 prefetcht0 0xbc(%r13)
> > a1a9: 00
> > ...
> >
> >
> > Kernel is compiled on Gentoo (64bit):
> > Linux version 2.6.31.9-x86_64 () (gcc version 4.3.4 (Gentoo 4.3.4
> > p1.0, pie-10.1.5) ) #1 SMP Mon Dec 28 15:49:16 CET 2009 The
> > affected server (HP DL360 G5) is running OpenSuSE-11.1, 32bit
> > userspace
> >
> > Any idea if there is a recent patch that could fix this issue? At
> > the crashing time the server was not specifically loaded and had
> > around 200 packets/s network traffic.
> >
> > Regards,
> > Bruno
View attachment "dmesg" of type "text/plain" (50098 bytes)
View attachment ".config" of type "text/plain" (51368 bytes)
Powered by blists - more mailing lists