netdev - weird bug with 2.6.28.9/10 tg3 and bcm5722 nic's

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <alpine.LRH.2.00.0906090954060.6437@ionlinux.tower-research.com>
Date:	Tue, 9 Jun 2009 10:29:21 -0400 (EDT)
From:	Ion Badulescu <ionut@...ula.org>
To:	netdev@...r.kernel.org
Subject: weird bug with 2.6.28.9/10 tg3 and bcm5722 nic's

Hi,

I'm hitting the weirdest bug with 2.6.28.9/10 and bcm5722 interfaces on 
a bunch of servers we use here. The interfaces are normally stable and 
working well. However, if hald gets restarted, the interfaces stop 
receiving packets in a matter of seconds, and require a down/up to get 
them back to working.

When this happens, the rx_discard interface stat gets incremented, and I'm 
pretty sure interrupts stop occurring for the interface, at least for rx.

A little background:

-- we're running a RHEL4 distribution with the minimal changes required to 
use newer kernels
-- hal version is 0.4.7 (from FC3 with a a couple of fixes to prevent it 
from segfaulting on newer kernels)
-- the problem does not occur when running kernel 2.6.23.16
-- the problem occurs very predictably when running kernel 2.6.28.10 on 
the affected hardware
-- the affected hardware are the on-board interfaces on Dell R300 servers 
and HP ProLiant DL160 G5p servers:

Dell R300# ethtool -i eth1
driver: tg3
version: 3.94
firmware-version: 5722-v3.08, ASFIPMI v6.02
bus-info: 0000:02:00.0

Dell R300# cat /proc/interrupts | grep eth1
758:     953465      39500      41436      41224   PCI-MSI-edge      eth1

HP DL160# ethtool -i eth1
driver: tg3
version: 3.94
firmware-version: 5722-v3.07, ASFIPMI v6.02
bus-info: 0000:04:00.0

Dell R300# cat /proc/interrupts | grep eth1
760:       2133         19         16  430865463         15         15         14         14   PCI-MSI-edge      eth1

-- other hardware using the tg3 driver that's not affected includes 5721 
and 5751 interfaces, some of them on-board Dell R200 and PE-860 servers, 
others on add-on NIC's:

Dell PE-860# ethtool -i eth1
driver: tg3
version: 3.94
firmware-version: 5721-v3.61, ASFIPMI v6.21

Dell PE-860# cat /proc/interrupts | grep eth1
  17:    1416666          0   IO-APIC-fasteoi   eth1

Add-on NIC# ethtool -i eth2
driver: tg3
version: 3.94
firmware-version: 5751-v3.29a

Add-on NIC# cat /proc/interrupts | grep eth2
  16:       1592         20         21 2050734806   IO-APIC-fasteoi   eth2

Perhaps the other common ground here is that all affected interfaces use 
MSI interrupts, whereas all the not-affected interfaces use APIC 
interrupts, as seen above? But 2.6.23.16 uses the same kind of interrupts 
and doesn't have this problem...

As far as I can tell from a strace of 'service haldaemon restart', all 
hald is doing is reading from a bunch of files from /sys. However, I tried 
replicating those actions externally and the problem did not occur, so 
maybe something else is going on there.

Anyway, I'm stumped. Any help or insight would be appreciated...

Thanks,
-Ion
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html