linux-kernel - Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0a44fd663e93ac5b36865b0080da52d94252791a.camel@xry111.site>
Date: Mon, 25 Mar 2024 01:19:40 +0800
From: Xi Ruoyao <xry111@...111.site>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: Grant Grundler <grundler@...omium.org>, bhelgaas@...gle.com, 
	linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org, 
	linuxppc-dev@...ts.ozlabs.org, mahesh@...ux.ibm.com, oohall@...il.com, 
	rajat.khandelwal@...ux.intel.com, rajatja@...omium.org
Subject: Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as
 KERN_INFO

On Mon, 2023-09-18 at 14:39 -0500, Bjorn Helgaas wrote:
> On Mon, Sep 18, 2023 at 07:42:30PM +0800, Xi Ruoyao wrote:
> > ...
> 
> > My workstation suffers from too much correctable AER reporting as well
> > (related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets May
> > Generate Correctable Errors" and/or the motherboard design, I guess).
> 
> We should rate-limit correctable error reporting so it's not
> overwhelming.
> 
> At the same time, I'm *also* interested in the cause of these errors,
> in case there's a Linux defect or a hardware erratum that we can work
> around.  Do you have a bug report with any more details, e.g., a dmesg
> log and "sudo lspci -vv" output?

Hi Bjorn,

Sorry for the *very* late reply (somehow I didn't see the reply at all
before it was removed by my cron job, and now I just savaged it from
lore.kernel.org...)

The dmesg is like:

[  882.456994] pcieport 0000:00:1c.1: AER: Multiple Correctable error message received from 0000:00:1c.1
[  882.457002] pcieport 0000:00:1c.1: AER: found no error details for 0000:00:1c.1
[  882.457003] pcieport 0000:00:1c.1: AER: Multiple Correctable error message received from 0000:06:00.0
[  883.545763] pcieport 0000:00:1c.1: AER: Multiple Correctable error message received from 0000:00:1c.1
[  883.545789] pcieport 0000:00:1c.1: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
[  883.545790] pcieport 0000:00:1c.1:   device [8086:7a39] error status/mask=00000001/00002000
[  883.545792] pcieport 0000:00:1c.1:    [ 0] RxErr                  (First)
[  883.545794] pcieport 0000:00:1c.1: AER:   Error of this Agent is reported first
[  883.545798] r8169 0000:06:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID)
[  883.545799] r8169 0000:06:00.0:   device [10ec:8125] error status/mask=00001101/0000e000
[  883.545800] r8169 0000:06:00.0:    [ 0] RxErr                  (First)
[  883.545801] r8169 0000:06:00.0:    [ 8] Rollover              
[  883.545802] r8169 0000:06:00.0:    [12] Timeout               
[  883.545815] pcieport 0000:00:1c.1: AER: Correctable error message received from 0000:00:1c.1
[  883.545823] pcieport 0000:00:1c.1: AER: found no error details for 0000:00:1c.1
[  883.545824] pcieport 0000:00:1c.1: AER: Multiple Correctable error message received from 0000:06:00.0

lspci output attached.

Intel has issued an errata "RPL013" saying:

"Under complex microarchitectural conditions, the PCIe controller may
transmit an incorrectly formed Transaction Layer Packet (TLP), which
will fail CRC checks. When this erratum occurs, the PCIe end point may
record correctable errors resulting in either a NAK or link recovery.
Intel® has not observed any functional impact due to this erratum."

But I'm really unsure if it describes my issue.

Do you think I have some broken hardware and I should replace the CPU
and/or the motherboard (where the r8169 is soldered)?  I've noticed that
my 13900K is almost impossible to overclock (despite it's a K), but I've
not encountered any issue other than these AER reporting so far after I
gave up overclocking.

-- 
Xi Ruoyao <xry111@...111.site>
School of Aerospace Science and Technology, Xidian University

View attachment "lspci" of type "text/plain" (65554 bytes)