lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2149597.8uJZFlvqrj@xrated>
Date:   Wed, 15 Jul 2020 10:11:11 +0200
From:   Hans-Peter Jansen <hpj@...la.net>
To:     linux-kernel@...r.kernel.org
Subject: Re: AMD PCI Bridge: Hardware error from APEI

Am Samstag, 11. Juli 2020, 18:32:21 CEST schrieben Sie:
> Am Dienstag, 7. Juli 2020, 08:56:41 CEST schrieben Sie:
> > Am Samstag, 27. Juni 2020, 20:23:35 CEST schrieben Sie:
> > > Dear hacker from the order of the penguins,
> > > 
> > > we're facing a disturbing issue here after swapping a motherboard of a
> > > mission critical system:
> > > 
> > > Specs:
> > > ASUS KNPA-U16 with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI
> > > (officially supported RAM modules)
> > > 
> > > openSUSE 15.1, Kernel 5.7.5
> > 
> > Not sure, how to proceed with this one?
> > 
> > After 9½ days uptime, it cumulated about 34,000 incidents:
> > 
> > [...]
> > 
> > Needless so say, that this is no permanent solution.
> > 
> > Any ideas anybody?
> 
> After swapping the PCIe slot for the Digital Devices Max S8 4/8, the error
> has moved:
> 
> 2020-07-11T18:25:34.380002+02:00 tyrex kernel: [  889.223783] {20}[Hardware
> Error]: Hardware error from APEI Generic Hardware Error Source: 4
> 2020-07-11T18:25:34.380025+02:00 tyrex kernel: [  889.223787] {20}[Hardware
> Error]: It has been corrected by h/w and requires no further action
> 2020-07-11T18:25:34.380028+02:00 tyrex kernel: [  889.223789] {20}[Hardware
> Error]: event severity: corrected 2020-07-11T18:25:34.380031+02:00 tyrex
> kernel: [  889.223791] {20}[Hardware Error]:  Error 0, type: corrected
> 2020-07-11T18:25:34.380032+02:00 tyrex kernel: [  889.223793] {20}[Hardware
> Error]:  fru_text: PcieError 2020-07-11T18:25:34.380034+02:00 tyrex kernel:
> [  889.223795] {20}[Hardware Error]:   section_type: PCIe error
> 2020-07-11T18:25:34.380577+02:00 tyrex kernel: [  889.223796] {20}[Hardware
> Error]:   port_type: 4, root port 2020-07-11T18:25:34.380586+02:00 tyrex
> kernel: [  889.223798] {20}[Hardware Error]:   version: 0.2
> 2020-07-11T18:25:34.380588+02:00 tyrex kernel: [  889.223800] {20}[Hardware
> Error]:   command: 0x0407, status: 0x0010 2020-07-11T18:25:34.380590+02:00
> tyrex kernel: [  889.223802] {20}[Hardware Error]:   device_id:
> 0000:40:03.1 2020-07-11T18:25:34.380591+02:00 tyrex kernel: [  889.223803]
> {20}[Hardware Error]:   slot: 16 2020-07-11T18:25:34.380593+02:00 tyrex
> kernel: [  889.223804] {20}[Hardware Error]:   secondary_bus: 0x41
> 2020-07-11T18:25:34.380595+02:00 tyrex kernel: [  889.223806] {20}[Hardware
> Error]:   vendor_id: 0x1022, device_id: 0x1453
> 2020-07-11T18:25:34.380597+02:00 tyrex kernel: [  889.223808] {20}[Hardware
> Error]:   class_code: 060400 2020-07-11T18:25:34.380599+02:00 tyrex kernel:
> [  889.223810] {20}[Hardware Error]:   bridge: secondary_status: 0x2000,
> control: 0x0012 2020-07-11T18:25:34.380601+02:00 tyrex kernel: [ 
> 889.223908] pcieport 0000:40:03.1: AER: aer_status: 0x00001000, aer_mask:
> 0x00006000 2020-07-11T18:25:34.380603+02:00 tyrex kernel: [  889.223912]
> pcieport 0000:40:03.1: AER:    [12] Timeout
> 2020-07-11T18:25:34.380605+02:00 tyrex kernel: [  889.223915] pcieport
> 0000:40:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
> 
> It looks like the system is creating such devices on demand:
> 
> 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models
> 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode]) Flags: bus master,
> fast devsel, latency 0, IRQ 39, NUMA node 2 Bus: primary=40, secondary=41,
> subordinate=41, sec-latency=0 I/O behind bridge: None
>         Memory behind bridge: e5d00000-e5dfffff [size=1M]
>         Prefetchable memory behind bridge: None
>         Capabilities: [50] Power Management version 3
>         Capabilities: [58] Express Root Port (Slot+), MSI 00
>         Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>         Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD]
> Family 17h (Models 00h-0fh) PCIe GPP Bridge Capabilities: [c8]
> HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100] Vendor
> Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150]
> Advanced Error Reporting
>         Capabilities: [270] #19
>         Capabilities: [2a0] Access Control Services
>         Capabilities: [370] L1 PM Substates
>         Capabilities: [380] Downstream Port Containment
>         Capabilities: [3c4] #23
>         Kernel driver in use: pcieport
> 
> in order to handle:
> 
> 41:00.0 Multimedia controller: Digital Devices GmbH Max
>         Subsystem: Digital Devices GmbH Max S8 4/8
>         Flags: bus master, fast devsel, latency 0, IRQ 181, NUMA node 2
>         Memory at e5d00000 (64-bit, non-prefetchable) [size=64K]
>         Capabilities: [50] Power Management version 3
>         Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit+
>         Capabilities: [90] Express Endpoint, MSI 00
>         Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0
> Len=00c <?> Kernel driver in use: ddbridge
>         Kernel modules: ddbridge

Here's the initialization sequence of these devices:

Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: [1022:1453] type 01 class 0x060400
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PME# supported from D0 D3hot D3cold
Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0: [dd01:0007] type 00 class 0x048000
Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0: reg 0x10: [mem 0xe5d00000-0xe5d0ffff 64bit]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PCI bridge to [bus 41]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1:   bridge window [mem 0xe5d00000-0xe5dfffff]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: PCI bridge to [bus 41]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1:   bridge window [mem 0xe5d00000-0xe5dfffff]
Jul 13 12:19:27 tyrex kernel: pci 0000:40:03.1: Adding to iommu group 41
Jul 13 12:19:27 tyrex kernel: pci 0000:41:00.0: Adding to iommu group 47
Jul 13 12:19:27 tyrex kernel: pcieport 0000:40:03.1: PME: Signaling with IRQ 39
Jul 13 12:19:27 tyrex kernel: pcieport 0000:40:03.1: AER: enabled with IRQ 39
Jul 13 12:19:27 tyrex kernel: pcieport 0000:40:03.1: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+

The last line is somewhat suspicious, but hard to decipher:

DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+

I'm pretty sure, this is related, but the deeper meaning is denied me.

Would be nice, if some enlightened person could shed some light
into this abyss.

Pete


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ