netdev - Re: tg3: Occassional death on 3.3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKb7Uvj4VpW+TYFKdK8174k5QbECBPjMZg3njgnMZHAcRJ2cWw@mail.gmail.com>
Date:	Wed, 18 Apr 2012 13:30:01 -0400
From:	Ilia Mirkin <imirkin@...m.mit.edu>
To:	Matt Carlson <mcarlson@...adcom.com>
Cc:	Michael Chan <mchan@...adcom.com>, netdev@...r.kernel.org
Subject: Re: tg3: Occassional death on 3.3

On Wed, Apr 18, 2012 at 1:19 PM, Matt Carlson <mcarlson@...adcom.com> wrote:
> On Tue, Apr 17, 2012 at 05:55:11PM -0700, Matt Carlson wrote:
>> On Tue, Apr 17, 2012 at 08:22:31PM -0400, Ilia Mirkin wrote:
>> > Hello,
>> >
>> > I'm observing an issue where it appears that tg3 gets wedged into a
>> > bad state every so often, and never recovers. Doing a sequence of
>> >
>> > ifconfig eth0 down
>> > rmmod broadcom
>> > rmmod tg3
>> >
>> > modprobe broadcom
>> > modprobe tg3
>> >
>> > Makes everything work again. The card I have:
>> >
>> > 03:00.0 Ethernet controller [0200]: Broadcom Corporation NetLink
>> > BCM57788 Gigabit Ethernet PCIe [14e4:1691] (rev 01)
>> >         Subsystem: Dell XPS 8300 [1028:04aa]
>> >         Flags: bus master, fast devsel, latency 0, IRQ 45
>> >         Memory at fb100000 (64-bit, non-prefetchable) [size=64K]
>> >         Capabilities: [48] Power Management version 3
>> >         Capabilities: [60] Vendor Specific Information: Len=6c <?>
>> >         Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
>> >         Capabilities: [cc] Express Endpoint, MSI 00
>> >         Capabilities: [100] Advanced Error Reporting
>> >         Capabilities: [13c] Virtual Channel
>> >         Capabilities: [160] Device Serial Number [the mac address]
>> >         Capabilities: [16c] Power Budgeting <?>
>> >         Kernel driver in use: tg3
>> >         Kernel modules: tg3
>> >
>> > I'm attaching the log for the full data dump, since my mailer will
>> > wrap it horribly otherwise, but the interesting lines are:
>> >
>> > [2234380.228971] tg3 0000:03:00.0: eth0: transmit timed out, resetting
>> > ... some kind of data dump ...
>> > [2234382.216682] tg3 0000:03:00.0: eth0: 0: Host status block
>> > [00000001:00000080:(0000:01ef:0000):(01ef:00b2)]
>> > [2234382.216685] tg3 0000:03:00.0: eth0: 0: NAPI info
>> > [00000080:00000080:(00cb:00b2:01ff):01ef:(00b7:0000:0000:0000)]
>> > [2234382.319550] tg3 0000:03:00.0: tg3_stop_block timed out, ofs=1400
>> > enable_bit=2
>> > [2234382.421931] tg3 0000:03:00.0: tg3_stop_block timed out, ofs=c00
>> > enable_bit=2
>> > [2234382.427199] tg3 0000:03:00.0: eth0: Link is down
>> > [2234382.438168] tg3 0000:03:00.0: eth0: Link is down
>> >
>> > Any further attempts to use the NIC, like ifconfig down/up result in a
>> > similar error log sequence happening. Also, while it's happening, the
>> > computer feels extremely laggy for a short period of time (~1s),
>> > leading me to believe it's doing an uninterruptible sleep of some kind
>> > going on.
>> >
>> > This has happened twice, and at least the second time, there was no
>> > unusual traffic on the network. It's linked at 100M, and was probably
>> > doing 10K/s at most when the error happened. If this is insufficient
>> > information, please let me know what I should collect next time this
>> > might happen.
>> >
>> > Thanks,
>>
>> Thanks for the report Ilia.  I'm guessing the lag is coming from the
>> driver as it logs the register dump to the syslog.
>>
>> The register dump shows that interrupts are enabled, but the tag value in
>> the interrupt mailbox register doesn't match the value in the status block
>> or the driver's private tag value.  I'll see where that leads me.
>
> O.K.  The interrupt mailbox issue I mentioned above can happen when in
> one-shot MSI mode, but it still smells strange.  If the tag stored in
> the device structure matches the tag value in the status block, there
> should be no more work to do.  In that case, the driver writes the new
> tag to the interrupt mailbox register.
>
> You may be encountering some type of hardware problem.  PCIe register
> 0x104 is the Uncorrectable Error Status register.  This register shows
> that the "Unsupported Request Error Status" bit is set.  This definitely
> shouldn't happen.
>
> Do you know if there are any firmware updates available for your system?
>

I just checked on Dell's website, and I'm pretty sure I have the
latest BIOS. I'll double-check later, but the website says version
A06, as does my dmidecode. So unless they have multiple releases with
the same version number... :) In case it's of interest, the system is
a i7-2600 CPU, 18G of RAM (memtested).

Are there some PCI(e) debug options I should be turning on to perhaps
get a better picture of what's going on?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html