[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJtEV7ZtWVg7tLuznPmBce56ARLn7hzk6F6wwXnJ9vo-_AdLtQ@mail.gmail.com>
Date: Wed, 23 Jul 2014 12:59:00 +0800
From: Andrew Cooks <acooks@...il.com>
To: Alexander Duyck <alexander.h.duyck@...el.com>
Cc: "Fujinaka, Todd" <todd.fujinaka@...el.com>,
Dmitry Lifshitz <lifshitz@...pulab.co.il>,
netdev <netdev@...r.kernel.org>,
"e1000-devel@...ts.sf.net" <e1000-devel@...ts.sf.net>,
Igor Grinberg <grinberg@...pulab.co.il>,
Linux NICS <linux.nics@...el.com>
Subject: Re: [E1000-devel] [linux-nics] Problem: 82574L device (e1000e
driver): Reset adapter unexpectedly / transmit queue 0 timed out
On Tue, Jul 22, 2014 at 11:25 PM, Alexander Duyck
<alexander.h.duyck@...el.com> wrote:
>>>>> # lspci -vvnnk:
>>>>> 01:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
>>>>> Subsystem: Intel Corporation Device [8086:0000]
>>>>> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>>>>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>> Interrupt: pin A routed to IRQ 16
>>>>> Region 0: [virtual] Memory at c1900000 (32-bit, non-prefetchable) [size=128K]
>>>>> Region 1: [virtual] Memory at c1800000 (32-bit, non-prefetchable) [size=1M]
>>>>> Region 2: I/O ports at 7000 [size=32]
>>>>> Region 3: [virtual] Memory at c1920000 (32-bit, non-prefetchable) [size=16K]
>>>>> [virtual] Expansion ROM at c1940000 [disabled] [size=256K]
>>>>> Capabilities: [c8] Power Management version 2
>>>>> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>>>> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>>>>> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
>>>>> Address: 0000000000000000 Data: 0000
>>>>> Capabilities: [e0] Express (v1) Endpoint, MSI 00
>>>>> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
>>>>> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>>>>> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>>>>> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>>>> MaxPayload 128 bytes, MaxReadReq 512 bytes
>>>>> DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
>>>>> LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
>>>>> ClockPM- Surprise- LLActRep- BwNot-
>>>>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
>>>>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>>>> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>>>> Capabilities: [a0] MSI-X: Enable- Count=5 Masked-
>>>>> Vector table: BAR=3 offset=00000000
>>>>> PBA: BAR=3 offset=00002000
>>>>> Capabilities: [100 v1] Advanced Error Reporting
>>>>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
>>>>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>>>> CESta: RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout- NonFatalErr+
>>>>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>>>>> AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn-
>>>>> Capabilities: [140 v1] Device Serial Number 00-01-c0-ff-ff-12-8a-64
>>>>> Kernel driver in use: e1000e
>>>>>
>>>>>
>
> It looks like something bad happened on the PCIe bus based on the RxErr,
> BadTLP, BadDLLP, and NonFatalERR indicators all being set. This could
> be an indication of a possible problem with the PCIe link on the system.
Thanks very much for explaining this. Is it correct to think that this
is likely to be a hardware problem?
>
>>>>> # ethtool -d eth2
>>>>> MAC Registers
>>>>> -------------
>>>>> 0x00000: CTRL (Device control register) 0xFFFFFFFF
>>>>> Endian mode (buffers): big
>>>>> Link reset: reset
>>>>> Set link up: 1
>>>>> Invert Loss-Of-Signal: yes
>>>>> Receive flow control: enabled
>>>>> Transmit flow control: enabled
>>>>> VLAN mode: enabled
>>>>> Auto speed detect: enabled
>>>>> Speed select: not used
>>>>> Force speed: yes
>>>>> Force duplex: yes
>>>>> 0x00008: STATUS (Device status register) 0xFFFFFFFF
>>>>> Duplex: full
>>>>> Link up: link config
>>>>> TBI mode: enabled
>>>>> Link speed: not used
>>>>> Bus type: PCI-X
>>>>> Bus speed: 133MHz
>>>>> Bus width: 64-bit
>>>>> 0x00100: RCTL (Receive control register) 0xFFFFFFFF
>>>>> Receiver: enabled
>>>>> Store bad packets: enabled
>>>>> Unicast promiscuous: enabled
>>>>> Multicast promiscuous: enabled
>>>>> Long packet: enabled
>>>>> Descriptor minimum threshold size: reserved
>>>>> Broadcast accept mode: accept
>>>>> VLAN filter: enabled
>>>>> Canonical form indicator: enabled
>>>>> Discard pause frames: ignored
>>>>> Pass MAC control frames: pass
>>>>> Receive buffer size: 4096
>>>>> 0x02808: RDLEN (Receive desc length) 0xFFFFFFFF
>>>>> 0x02810: RDH (Receive desc head) 0xFFFFFFFF
>>>>> 0x02818: RDT (Receive desc tail) 0xFFFFFFFF
>>>>> 0x02820: RDTR (Receive delay timer) 0xFFFFFFFF
>>>>> 0x00400: TCTL (Transmit ctrl register) 0xFFFFFFFF
>>>>> Transmitter: enabled
>>>>> Pad short packets: enabled
>>>>> Software XOFF Transmission: enabled
>>>>> Re-transmit on late collision: enabled
>>>>> 0x03808: TDLEN (Transmit desc length) 0xFFFFFFFF
>>>>> 0x03810: TDH (Transmit desc head) 0xFFFFFFFF
>>>>> 0x03818: TDT (Transmit desc tail) 0xFFFFFFFF
>>>>> 0x03820: TIDV (Transmit delay timer) 0xFFFFFFFF
>>>>> PHY type: unknown
>>>>>
>>>>>
>
> The device doesn't appear to be responding to MMIO reads based on the
> fact that all of the registers are returning all 1's.
Yes, this was the output of ethtool when the device became
non-functional. I wasn't sure whether it would be useful.
> You should be able to recover from this error by issuing a PCIe device
> reset request via the sysfs interface (echo 1 >
> /sys/bus/pci/devices/0000\:01\:00.0/reset). However that only resolves
> the issue after it has occurred.
Thanks.
> One thing that would probably be useful would be to provide an "lspci
> -vvv" for the entire system. That would at least give us an idea of the
> PCIe hierarchy and could help to tell us if the problem is something in
> the local PCIe hierarchy for the device, or if the problem is closer to
> the root complex.
I've attached the complete lspci output, because it's quite large to
include inline. I hope that's ok. It shows some interesting
differences between device 01:00.0 (the one that error'ed) and the
other 82574L devices.
# lspci -tvvvnn
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 14h
Processor Root Complex [1022:1510]
+-01.0 Advanced Micro Devices, Inc. [AMD/ATI] Wrestler
[Radeon HD 6320] [1002:9806]
+-01.1 Advanced Micro Devices, Inc. [AMD/ATI] Wrestler
HDMI Audio [1002:1314]
+-04.0-[01]----00.0 Intel Corporation 82574L Gigabit
Network Connection [8086:10d3]
+-05.0-[02]----00.0 Intel Corporation 82574L Gigabit
Network Connection [8086:10d3]
+-06.0-[03]----00.0 Intel Corporation 82574L Gigabit
Network Connection [8086:10d3]
+-07.0-[04]----00.0 Intel Corporation 82574L Gigabit
Network Connection [8086:10d3]
+-11.0 Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391]
+-12.0 Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
+-12.2 Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
+-13.0 Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
+-13.2 Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
+-14.0 Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus
Controller [1002:4385]
+-14.3 Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d]
+-14.4-[05]--
+-14.5 Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
+-15.0-[06-07]----00.0 Realtek Semiconductor Co., Ltd.
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
+-15.1-[08]----00.0 Realtek Semiconductor Co., Ltd.
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
+-15.2-[09]----00.0 Realtek Semiconductor Co., Ltd.
RTL8723AE PCIe Wireless Network Adapter [10ec:8723]
+-16.0 Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
+-16.2 Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
+-18.0 Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 0 [1022:1700]
+-18.1 Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 1 [1022:1701]
+-18.2 Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 2 [1022:1702]
+-18.3 Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 3 [1022:1703]
+-18.4 Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 4 [1022:1704]
+-18.5 Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 6 [1022:1718]
+-18.6 Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 5 [1022:1716]
\-18.7 Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 7 [1022:1719]
Thanks!
a.
View attachment "lspci.txt" of type "text/plain" (55772 bytes)
Powered by blists - more mailing lists