linux-kernel - Re: Strange errors from e1000 driver (2.6.18)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4807377b0610221515m42e638c3pfb461fbb7133845e@mail.gmail.com>
Date:	Sun, 22 Oct 2006 15:15:27 -0700
From:	"Jesse Brandeburg" <jesse.brandeburg@...il.com>
To:	"Martin J. Bligh" <mbligh@...gle.com>
Cc:	"Martin J. Bligh" <mbligh@...igh.org>,
	"Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>,
	netdev@...r.kernel.org
Subject: Re: Strange errors from e1000 driver (2.6.18)

Analysis follows, but I wanted to ask you to bisect back if you can to
find the apparent patch to make the difference.  Basically at this
point I'd say its not likely to be an e1000 issue, but I'd like to
follow up and make sure.

On 10/22/06, Martin J. Bligh <mbligh@...gle.com> wrote:
> 0000:00:0a.0 Ethernet controller: Intel Corporation 82546EB Gigabit
> Ethernet Con
> troller (Copper) (rev 01)
>          Latency: 32 (63750ns min), Cache Line Size: 0x08 (32 bytes)
>          Interrupt: pin A routed to IRQ 5
>          Region 0: Memory at ef020000 (64-bit, non-prefetchable) [size=128K]
>          Region 4: I/O ports at a000 [size=64]
>          Capabilities: [dc] Power Management version 2
>                  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot
> +,D3cold+)
>                  Status: D0 PME-Enable- DSel=0 DScale=1 PME-
>          Capabilities: [e4]      Capabilities: [f0] Message Signalled
> Interrupts:
>   64bit+ Queue=0/0 Enable-
>                  Address: 0000000000000000  Data: 0000
>
> 0000:00:0a.1 Ethernet controller: Intel Corporation 82546EB Gigabit
> Ethernet Con
> troller (Copper) (rev 01)
>          Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter
>          Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Step
> ping- SERR- FastB2B-
>          Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium
>  >TAbort- <TAbort
> - <MAbort- >SERR- <PERR-
>          Latency: 32 (63750ns min), Cache Line Size: 0x08 (32 bytes)
>          Interrupt: pin B routed to IRQ 11
>          Region 0: Memory at ef000000 (64-bit, non-prefetchable) [size=128K]
>          Region 4: I/O ports at a400 [size=64]
>          Capabilities: [dc] Power Management version 2
>                  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot
> +,D3cold+)
>                  Status: D0 PME-Enable- DSel=0 DScale=1 PME-
>          Capabilities: [e4]      Capabilities: [f0] Message Signalled
> Interrupts:
>   64bit+ Queue=0/0 Enable-
>                  Address: 0000000000000000  Data: 0000

Nothing seems out of order, but the latency may be low, I'd be curious
what these looked like before with the old kernel.  Some of the other
things to compare would have been the lspci -vv output from your
chipset with old/new kernel, in case the bridge/system configuration
changed.  There are no known problems right now with this chipset
82546EB

> > cat /proc/interrupts,
>
>             CPU0
>    5:    1975991          XT-PIC  ehci_hcd:usb2, VIA8237, eth0
> NMI:          0
> LOC:  146264664
> ERR:      52805

shared int, fine, but whats with the ERR: ?

> > dmesg
>
> Did that bit already.

except you didn't include any of the e1000 load information nor the
system's boot information as it came up.

> > ethtool -e eth0,
>
> root@...us:/usr/local/autotest/bin # ethtool -e eth0
> Offset          Values
> ------          ------
> 0x0000          00 07 e9 09 0b 08 30 05 ff ff ff ff ff ff ff ff
> 0x0010          44 a9 03 98 0b 46 11 10 86 80 10 10 86 80 68 34
> 0x0020          0c 00 10 10 00 00 02 21 c8 18 ff ff ff ff ff ff
> 0x0030          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 0x0040          0c c3 61 78 04 50 02 21 c8 08 ff ff ff ff ff ff
> 0x0050          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 02 06
> 0x0060          2c 00 00 40 07 11 00 00 2c 00 00 40 ff ff ff ff
nothing out of order here that I can see immediately.

> > and maybe output of
> > dmidecode, etc.
>
> Attached.
>
> > only a little.  There are so many different pieces of e1000 hardware
> > and so few specifics in this report that I'll be able to tell you lots
> > more when you get us the info requested.
>
> Thanks. Not sure if the bug wasn't there in earlier kernels, or if we
> just weren't printing anything.

I think it may not have been in earlier kernels, but I also don't
think this is an e1000 problem, at least initially.

<snip>
> BIOS Information
>         Vendor: Phoenix Technologies, LTD
>         Version: 6.00 PG
>         Release Date: 09/13/2004
>         Address: 0xE0000
>         Runtime Size: 128 kB
>         ROM Size: 256 kB
<snip>

> Handle 0x0001, DMI type 1, 25 bytes.
> System Information
>         Manufacturer: VIA Technologies, Inc.
>         Product Name: KT600-8237
>
> Handle 0x0002, DMI type 2, 8 bytes.
> Base Board Information
>         Manufacturer:
>         Product Name: KT600-8237
>         Version:
>         Serial Number:

This chipset is one of the most frequent common elements in problem
reports of TX hangs for e1000.  My current theory (we've bought a
bunch of these systems and never reproduced the issue) is that there
is something either design specific or BIOS specific that causes this
chipset to interact very badly with e1000 hardware.  Some systems have
the issue and some don't.  If you could bisect back to a working point
it would be interesting to see where that pointed.

> Handle 0x0004, DMI type 4, 32 bytes.
> Processor Information
>         Socket Designation: Socket A
>         Type: Central Processor
>         Family: Athlon XP
>         Manufacturer: AMD
>         ID: A0 06 00 00 FF FB 83 03
>         Signature: Family 6, Model A, Stepping 0
>         Version: AMD Athlon(tm) XP
>         Voltage: 1.5 V
>         External Clock: 100 MHz
>         Max Speed: 3000 MHz
>         Current Speed: 1100 MHz

doesn't seem you're overclocked.  Good.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/