linux-kernel - Re: BNX2: Kernel crashes with 2.6.31 and 2.6.31.9

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100223131508.4c6cb866@neptune.home>
Date:	Tue, 23 Feb 2010 13:15:08 +0100
From:	Bruno Prémont <bonbons@...ux-vserver.org>
To:	"Benjamin Li" <benli@...adcom.com>
Cc:	NetDEV <netdev@...r.kernel.org>,
	"Michael Chan" <mchan@...adcom.com>,
	Linux-Kernel <linux-kernel@...r.kernel.org>
Subject: Re: BNX2: Kernel crashes with 2.6.31 and 2.6.31.9

Hi Benjamin,

On Fri, 19 February 2010 "Benjamin Li" <benli@...adcom.com> wrote:
> >From your logs it looks like the device came up using MSI, but in the
> MSI-X poll routine was being called:
> 
> [    9.836673] bnx2: eth0: using MSI
> ...
> 
> [  134.643459]  [<ffffffffa004019e>] bnx2_poll_msix+0x3e/0xd0 [bnx2]
> [  134.643465]  [<ffffffff8135bcd1>] netpoll_poll+0xe1/0x3c0
> 
> which is incorrect.  If we are in MSI mode, the bnx2_poll() routine
> should be used.
> 
> I think what is going on here is that during the bnx2x driver
> initialization the current bnx2 driver adds all possible NAPI
> structures that map to all the hardware vectors (BNX2_MAX_MSIX_VEC=9)
> to the NAPI list in the net_device structure regardless if they are
> used or not (Seen in drivers/net/bnx2.c:bnx2_init_napi()).  This can
> cause uninitialized NAPI structures to be placed on the napi_list.
> Because this device is in MSI mode, only 1 vector is initialized.
> Now, the problem is triggered when net/core/netpoll.c:poll_napi() is
> called. This is because this routine will run through the entire
> napi_list calling all the poll routines.  In your particular case, it
> is calling the poll routine on an uninitialized vector causing the
> kernel panic.
> 
> Please try the patch below to see if it solves your problem.  Note,
> this only have been compile tested and tested against basic traffic
> runs. Unfortunately, I could not reproduce the kernel panic with the
> instructions below to verify the patch.
> 
> Thanks again for all your help in helping us track this down.

I applied the patch today and tried to reproduce with my showcases.

Seems that it's harder to trigger now but I still end up being able to
crash the box. Don't know if it's the same cause or not (could also
be the tcp-retransmit ghost)...

This time I had to run a few paralell scp's (8Mb/s each) to the box and 
'echo t > /proc/sysrq-trigger' multiple times via ssh session for it to
happen. It didn't trigger with by netbomb though I will try some more
and see)

I don't know if it's the same reason or not (hopefully something
reached disk as serial console is dead and pings are not
answered anymore.
It's probably some printk/bug/warn that triggers in network stack and
deadlocks with netconsole.

Regards,
Bruno
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/