lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJtEV7ZtWVg7tLuznPmBce56ARLn7hzk6F6wwXnJ9vo-_AdLtQ@mail.gmail.com>
Date:	Wed, 23 Jul 2014 12:59:00 +0800
From:	Andrew Cooks <acooks@...il.com>
To:	Alexander Duyck <alexander.h.duyck@...el.com>
Cc:	"Fujinaka, Todd" <todd.fujinaka@...el.com>,
	Dmitry Lifshitz <lifshitz@...pulab.co.il>,
	netdev <netdev@...r.kernel.org>,
	"e1000-devel@...ts.sf.net" <e1000-devel@...ts.sf.net>,
	Igor Grinberg <grinberg@...pulab.co.il>,
	Linux NICS <linux.nics@...el.com>
Subject: Re: [E1000-devel] [linux-nics] Problem: 82574L device (e1000e
 driver): Reset adapter unexpectedly / transmit queue 0 timed out

On Tue, Jul 22, 2014 at 11:25 PM, Alexander Duyck
<alexander.h.duyck@...el.com> wrote:
>>>>> # lspci -vvnnk:
>>>>> 01:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
>>>>>         Subsystem: Intel Corporation Device [8086:0000]
>>>>>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>>>>>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>>         Interrupt: pin A routed to IRQ 16
>>>>>         Region 0: [virtual] Memory at c1900000 (32-bit, non-prefetchable) [size=128K]
>>>>>         Region 1: [virtual] Memory at c1800000 (32-bit, non-prefetchable) [size=1M]
>>>>>         Region 2: I/O ports at 7000 [size=32]
>>>>>         Region 3: [virtual] Memory at c1920000 (32-bit, non-prefetchable) [size=16K]
>>>>>         [virtual] Expansion ROM at c1940000 [disabled] [size=256K]
>>>>>         Capabilities: [c8] Power Management version 2
>>>>>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>>>>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>>>>>         Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
>>>>>                 Address: 0000000000000000  Data: 0000
>>>>>         Capabilities: [e0] Express (v1) Endpoint, MSI 00
>>>>>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
>>>>>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>>>>>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>>>>>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>>>>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>>>>>                 DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
>>>>>                 LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
>>>>>                         ClockPM- Surprise- LLActRep- BwNot-
>>>>>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
>>>>>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>>>>                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>>>>         Capabilities: [a0] MSI-X: Enable- Count=5 Masked-
>>>>>                 Vector table: BAR=3 offset=00000000
>>>>>                 PBA: BAR=3 offset=00002000
>>>>>         Capabilities: [100 v1] Advanced Error Reporting
>>>>>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
>>>>>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>                 UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>>>>                 CESta:  RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout- NonFatalErr+
>>>>>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>>>>>                 AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn-
>>>>>         Capabilities: [140 v1] Device Serial Number 00-01-c0-ff-ff-12-8a-64
>>>>>         Kernel driver in use: e1000e
>>>>>
>>>>>
>
> It looks like something bad happened on the PCIe bus based on the RxErr,
> BadTLP, BadDLLP, and NonFatalERR indicators all being set.  This could
> be an indication of a possible problem with the PCIe link on the system.

Thanks very much for explaining this. Is it correct to think that this
is likely to be a hardware problem?

>
>>>>> # ethtool -d eth2
>>>>> MAC Registers
>>>>> -------------
>>>>> 0x00000: CTRL (Device control register)  0xFFFFFFFF
>>>>>       Endian mode (buffers):             big
>>>>>       Link reset:                        reset
>>>>>       Set link up:                       1
>>>>>       Invert Loss-Of-Signal:             yes
>>>>>       Receive flow control:              enabled
>>>>>       Transmit flow control:             enabled
>>>>>       VLAN mode:                         enabled
>>>>>       Auto speed detect:                 enabled
>>>>>       Speed select:                      not used
>>>>>       Force speed:                       yes
>>>>>       Force duplex:                      yes
>>>>> 0x00008: STATUS (Device status register) 0xFFFFFFFF
>>>>>       Duplex:                            full
>>>>>       Link up:                           link config
>>>>>       TBI mode:                          enabled
>>>>>       Link speed:                        not used
>>>>>       Bus type:                          PCI-X
>>>>>       Bus speed:                         133MHz
>>>>>       Bus width:                         64-bit
>>>>> 0x00100: RCTL (Receive control register) 0xFFFFFFFF
>>>>>       Receiver:                          enabled
>>>>>       Store bad packets:                 enabled
>>>>>       Unicast promiscuous:               enabled
>>>>>       Multicast promiscuous:             enabled
>>>>>       Long packet:                       enabled
>>>>>       Descriptor minimum threshold size: reserved
>>>>>       Broadcast accept mode:             accept
>>>>>       VLAN filter:                       enabled
>>>>>       Canonical form indicator:          enabled
>>>>>       Discard pause frames:              ignored
>>>>>       Pass MAC control frames:           pass
>>>>>       Receive buffer size:               4096
>>>>> 0x02808: RDLEN (Receive desc length)     0xFFFFFFFF
>>>>> 0x02810: RDH   (Receive desc head)       0xFFFFFFFF
>>>>> 0x02818: RDT   (Receive desc tail)       0xFFFFFFFF
>>>>> 0x02820: RDTR  (Receive delay timer)     0xFFFFFFFF
>>>>> 0x00400: TCTL (Transmit ctrl register)   0xFFFFFFFF
>>>>>       Transmitter:                       enabled
>>>>>       Pad short packets:                 enabled
>>>>>       Software XOFF Transmission:        enabled
>>>>>       Re-transmit on late collision:     enabled
>>>>> 0x03808: TDLEN (Transmit desc length)    0xFFFFFFFF
>>>>> 0x03810: TDH   (Transmit desc head)      0xFFFFFFFF
>>>>> 0x03818: TDT   (Transmit desc tail)      0xFFFFFFFF
>>>>> 0x03820: TIDV  (Transmit delay timer)    0xFFFFFFFF
>>>>> PHY type:                                unknown
>>>>>
>>>>>
>
> The device doesn't appear to be responding to MMIO reads based on the
> fact that all of the registers are returning all 1's.

Yes, this was the output of ethtool when the device became
non-functional. I wasn't sure whether it would be useful.

> You should be able to recover from this error by issuing a PCIe device
> reset request via the sysfs interface (echo 1 >
> /sys/bus/pci/devices/0000\:01\:00.0/reset).  However that only resolves
> the issue after it has occurred.

Thanks.

> One thing that would probably be useful would be to provide an "lspci
> -vvv" for the entire system.  That would at least give us an idea of the
> PCIe hierarchy and could help to tell us if the problem is something in
> the local PCIe hierarchy for the device, or if the problem is closer to
> the root complex.

I've attached the complete lspci output, because it's quite large to
include inline. I hope that's ok. It shows some interesting
differences between device 01:00.0 (the one that error'ed) and the
other 82574L devices.

# lspci -tvvvnn
-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 14h
Processor Root Complex [1022:1510]
           +-01.0  Advanced Micro Devices, Inc. [AMD/ATI] Wrestler
[Radeon HD 6320] [1002:9806]
           +-01.1  Advanced Micro Devices, Inc. [AMD/ATI] Wrestler
HDMI Audio [1002:1314]
           +-04.0-[01]----00.0  Intel Corporation 82574L Gigabit
Network Connection [8086:10d3]
           +-05.0-[02]----00.0  Intel Corporation 82574L Gigabit
Network Connection [8086:10d3]
           +-06.0-[03]----00.0  Intel Corporation 82574L Gigabit
Network Connection [8086:10d3]
           +-07.0-[04]----00.0  Intel Corporation 82574L Gigabit
Network Connection [8086:10d3]
           +-11.0  Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391]
           +-12.0  Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
           +-12.2  Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
           +-13.0  Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
           +-13.2  Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
           +-14.0  Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus
Controller [1002:4385]
           +-14.3  Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d]
           +-14.4-[05]--
           +-14.5  Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
           +-15.0-[06-07]----00.0  Realtek Semiconductor Co., Ltd.
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
           +-15.1-[08]----00.0  Realtek Semiconductor Co., Ltd.
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
           +-15.2-[09]----00.0  Realtek Semiconductor Co., Ltd.
RTL8723AE PCIe Wireless Network Adapter [10ec:8723]
           +-16.0  Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
           +-16.2  Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
           +-18.0  Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 0 [1022:1700]
           +-18.1  Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 1 [1022:1701]
           +-18.2  Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 2 [1022:1702]
           +-18.3  Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 3 [1022:1703]
           +-18.4  Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 4 [1022:1704]
           +-18.5  Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 6 [1022:1718]
           +-18.6  Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 5 [1022:1716]
           \-18.7  Advanced Micro Devices, Inc. [AMD] Family 12h/14h
Processor Function 7 [1022:1719]


Thanks!

a.

View attachment "lspci.txt" of type "text/plain" (55772 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ