[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2149597.8uJZFlvqrj@xrated>
Date: Wed, 15 Jul 2020 10:11:11 +0200
From: Hans-Peter Jansen <hpj@...la.net>
To: linux-kernel@...r.kernel.org
Subject: Re: AMD PCI Bridge: Hardware error from APEI
Am Samstag, 11. Juli 2020, 18:32:21 CEST schrieben Sie:
> Am Dienstag, 7. Juli 2020, 08:56:41 CEST schrieben Sie:
> > Am Samstag, 27. Juni 2020, 20:23:35 CEST schrieben Sie:
> > > Dear hacker from the order of the penguins,
> > >
> > > we're facing a disturbing issue here after swapping a motherboard of a
> > > mission critical system:
> > >
> > > Specs:
> > > ASUS KNPA-U16 with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI
> > > (officially supported RAM modules)
> > >
> > > openSUSE 15.1, Kernel 5.7.5
> >
> > Not sure, how to proceed with this one?
> >
> > After 9½ days uptime, it cumulated about 34,000 incidents:
> >
> > [...]
> >
> > Needless so say, that this is no permanent solution.
> >
> > Any ideas anybody?
>
> After swapping the PCIe slot for the Digital Devices Max S8 4/8, the error
> has moved:
>
> 2020-07-11T18:25:34.380002+02:00 tyrex kernel: [ 889.223783] {20}[Hardware
> Error]: Hardware error from APEI Generic Hardware Error Source: 4
> 2020-07-11T18:25:34.380025+02:00 tyrex kernel: [ 889.223787] {20}[Hardware
> Error]: It has been corrected by h/w and requires no further action
> 2020-07-11T18:25:34.380028+02:00 tyrex kernel: [ 889.223789] {20}[Hardware
> Error]: event severity: corrected 2020-07-11T18:25:34.380031+02:00 tyrex
> kernel: [ 889.223791] {20}[Hardware Error]: Error 0, type: corrected
> 2020-07-11T18:25:34.380032+02:00 tyrex kernel: [ 889.223793] {20}[Hardware
> Error]: fru_text: PcieError 2020-07-11T18:25:34.380034+02:00 tyrex kernel:
> [ 889.223795] {20}[Hardware Error]: section_type: PCIe error
> 2020-07-11T18:25:34.380577+02:00 tyrex kernel: [ 889.223796] {20}[Hardware
> Error]: port_type: 4, root port 2020-07-11T18:25:34.380586+02:00 tyrex
> kernel: [ 889.223798] {20}[Hardware Error]: version: 0.2
> 2020-07-11T18:25:34.380588+02:00 tyrex kernel: [ 889.223800] {20}[Hardware
> Error]: command: 0x0407, status: 0x0010 2020-07-11T18:25:34.380590+02:00
> tyrex kernel: [ 889.223802] {20}[Hardware Error]: device_id:
> 0000:40:03.1 2020-07-11T18:25:34.380591+02:00 tyrex kernel: [ 889.223803]
> {20}[Hardware Error]: slot: 16 2020-07-11T18:25:34.380593+02:00 tyrex
> kernel: [ 889.223804] {20}[Hardware Error]: secondary_bus: 0x41
> 2020-07-11T18:25:34.380595+02:00 tyrex kernel: [ 889.223806] {20}[Hardware
> Error]: vendor_id: 0x1022, device_id: 0x1453
> 2020-07-11T18:25:34.380597+02:00 tyrex kernel: [ 889.223808] {20}[Hardware
> Error]: class_code: 060400 2020-07-11T18:25:34.380599+02:00 tyrex kernel:
> [ 889.223810] {20}[Hardware Error]: bridge: secondary_status: 0x2000,
> control: 0x0012 2020-07-11T18:25:34.380601+02:00 tyrex kernel: [
> 889.223908] pcieport 0000:40:03.1: AER: aer_status: 0x00001000, aer_mask:
> 0x00006000 2020-07-11T18:25:34.380603+02:00 tyrex kernel: [ 889.223912]
> pcieport 0000:40:03.1: AER: [12] Timeout
> 2020-07-11T18:25:34.380605+02:00 tyrex kernel: [ 889.223915] pcieport
> 0000:40:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
>
> It looks like the system is creating such devices on demand:
>
> 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models
> 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode]) Flags: bus master,
> fast devsel, latency 0, IRQ 39, NUMA node 2 Bus: primary=40, secondary=41,
> subordinate=41, sec-latency=0 I/O behind bridge: None
> Memory behind bridge: e5d00000-e5dfffff [size=1M]
> Prefetchable memory behind bridge: None
> Capabilities: [50] Power Management version 3
> Capabilities: [58] Express Root Port (Slot+), MSI 00
> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD]
> Family 17h (Models 00h-0fh) PCIe GPP Bridge Capabilities: [c8]
> HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100] Vendor
> Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150]
> Advanced Error Reporting
> Capabilities: [270] #19
> Capabilities: [2a0] Access Control Services
> Capabilities: [370] L1 PM Substates
> Capabilities: [380] Downstream Port Containment
> Capabilities: [3c4] #23
> Kernel driver in use: pcieport
>
> in order to handle:
>
> 41:00.0 Multimedia controller: Digital Devices GmbH Max
> Subsystem: Digital Devices GmbH Max S8 4/8
> Flags: bus master, fast devsel, latency 0, IRQ 181, NUMA node 2
> Memory at e5d00000 (64-bit, non-prefetchable) [size=64K]
> Capabilities: [50] Power Management version 3
> Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit+
> Capabilities: [90] Express Endpoint, MSI 00
> Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0
> Len=00c <?> Kernel driver in use: ddbridge
> Kernel modules: ddbridge
Here's the initialization sequence of these devices:
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: [1022:1453] type 01 class 0x060400
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PME# supported from D0 D3hot D3cold
Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0: [dd01:0007] type 00 class 0x048000
Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0: reg 0x10: [mem 0xe5d00000-0xe5d0ffff 64bit]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PCI bridge to [bus 41]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: bridge window [mem 0xe5d00000-0xe5dfffff]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PCI bridge to [bus 41]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: bridge window [mem 0xe5d00000-0xe5dfffff]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: Adding to iommu group 41
Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0: Adding to iommu group 47
Jul 13 12:19:27 tyrex kernel: pcieport 0000:40:03.1: PME: Signaling with IRQ 39
Jul 13 12:19:27 tyrex kernel: pcieport 0000:40:03.1: AER: enabled with IRQ 39
Jul 13 12:19:27 tyrex kernel: pcieport 0000:40:03.1: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+
The last line is somewhat suspicious, but hard to decipher:
DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+
I'm pretty sure, this is related, but the deeper meaning is denied me.
Would be nice, if some enlightened person could shed some light
into this abyss.
Pete
Powered by blists - more mailing lists