lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADaLNDkOLiyPmVC3VwpnbrfAKFmNwSVJKdChjQFdQqs7XP-Ddg@mail.gmail.com>
Date:	Tue, 28 Jul 2015 14:50:39 -0700
From:	Duc Dang <dhdang@....com>
To:	Bjorn Helgaas <bhelgaas@...gle.com>
Cc:	Tanmay Inamdar <tinamdar@....com>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
	linux-arm <linux-arm-kernel@...ts.infradead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Tue, Jul 28, 2015 at 2:29 PM, Bjorn Helgaas <bhelgaas@...gle.com> wrote:
> On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
>> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@...gle.com> wrote:
>> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@....com> wrote:
>> >> Hi Bjorn,
>> >>
>> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@...gle.com> wrote:
>> >>>
>> >>> I regularly see faults like this on an APM X-Gene:
>> >>>
>> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> >>>        32 KB ICACHE, 32 KB DCACHE
>> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> >>>   ...
>> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>> >>>   Internal error: : 96000010 [#1] SMP
>> >>>   Modules linked in:
>> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> >>>   Hardware name: APM X-Gene Mustang board (DT)
>> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
>> >>>   LR is at pci_generic_config_read32+0x40/0xb8
>> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> >>>   ...
>> >>>   Call trace:
>> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>> >>
>> >> The log shows kernel gets an exception when trying to access Mellanox
>> >> card configuration space. This is usually due to suboptimal PCIe
>> >> SerDes parameters are using in your board, which will cause bad link
>> >> quality.
>> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> >> U-Boot upgrade to our latest X-Gene U-Boot release.
>> >
>> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
>> > seeing this issue regularly, approx once/hour.
>>
>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>> version to use. Are you running any PCIe traffic test when the error
>> happens?
>
> Nope, the machine was either idle or running a reboot test; no PCIe stress
> test or anything.
>
>> And it will be useful if you can share your "lspci -vvv" output when
>> the board is running, we can check to see if there is any error status
>> reported.
>
> Here's some lspci output and info about the firmware I'm running.
> Obviously this lspci output was collected before a crash.  I have also
> seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.
>
> U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)
>
> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>      32 KB ICACHE, 32 KB DCACHE
>      SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> Boot from SPI-NOR
> Slimpro FW:
>         Ver: 2.4 (build 01.15.12.00 2015/05/20)
>         PMD: 970 mV
>         SOC: 950 mV
> Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
> I2C:   ready
> DRAM:  ECC 32 GiB @ 1600MHz
> SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
> MMC:   X-Gene SD/SDIO/eMMC: 0
> PCIE0: (RC) X8 GEN-3 link up
>   00:00.0     - 10e8:e004 - Bridge device
>    01:00.0    - 15b3:1007 - Network controller
>
> # lspci -vvv
> 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>         I/O behind bridge: 0000f000-00000fff
>         Memory behind bridge: 80000000-82ffffff
>         Prefetchable memory behind bridge: 0000000083000000-00000000830fffff
>         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
>         BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
>                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>         Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
>                         ExtTag- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 256 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+
>                 LnkCap: Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited
>                         ClockPM- Surprise+ LLActRep+ BwNot+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>                 SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>                         Slot #1, PowerLimit 10.000W; Interlock- NoCompl-
>                 SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
>                         Control: AttnInd Off, PwrInd Off, Power- Interlock-
>                 SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
>                         Changed: MRL- PresDet- LinkState+
>                 RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
>                 RootCap: CRSVisible-
>                 RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>                 DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
>                 LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB

Target Link Speed unknown is really strange. I also saw the same "Link
speed unknown" for Mellanox card below.

>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB
>         Capabilities: [80] Power Management version 3
>                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [100 v1] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>                 CEMsk:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>         Capabilities: [180 v1] #19
>         Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
>         Kernel driver in use: pcieport
>
> 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-

Mem and BusMaster are disabled. So this card is not functional?

>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Interrupt: pin A routed to IRQ 226
>         Region 0: [virtual] Memory at e182000000 (32-bit, non-prefetchable) [size=1M]
>         Region 2: [virtual] Memory at e180000000 (32-bit, non-prefetchable) [size=32M]
>         [virtual] Expansion ROM at e183000000 [disabled] [size=1M]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [9c] MSI-X: Enable- Count=64 Masked-

This may be unrelated, but MSI allocation fails for this card somehow.

>                 Vector table: BAR=0 offset=0007c000
>                 PBA: BAR=0 offset=0007d000
>         Capabilities: [60] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>                 LnkCap: Port #8, Speed unknown, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
>                         ClockPM- Surprise- LLActRep- BwNot-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
>                 LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB
>         Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
>                 ARICap: MFVC- ACS-, Next Function: 0
>                 ARICtl: MFVC- ACS-, Function Group: 0
>         Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx

The serial number here seems invalid. I have a Mellanox card but
different model (ConnectX-3 15b3:1003) that shows meaningful serial
number:
Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-0b-c2-30.

Do you have another PCIe card to try on the same reboot test on this board?

>         Capabilities: [154 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>         Capabilities: [18c v1] #19
>         Kernel modules: mlx4_core

-- 
Regards,
Duc Dang.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ