linux-kernel - Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20260123122204.00003da3@huawei.com>
Date: Fri, 23 Jan 2026 12:22:04 +0000
From: Jonathan Cameron <jonathan.cameron@...wei.com>
To: <dan.j.williams@...el.com>
CC: Lukas Wunner <lukas@...ner.de>, Terry Bowman <terry.bowman@....com>,
	<dave@...olabs.net>, <dave.jiang@...el.com>, <alison.schofield@...el.com>,
	<bhelgaas@...gle.com>, <shiju.jose@...wei.com>, <ming.li@...omail.com>,
	<Smita.KoralahalliChannabasappa@....com>, <rrichter@....com>,
	<dan.carpenter@...aro.org>, <PradeepVineshReddy.Kodamati@....com>,
	<Benjamin.Cheatham@....com>, <sathyanarayanan.kuppuswamy@...ux.intel.com>,
	<linux-cxl@...r.kernel.org>, <vishal.l.verma@...el.com>, <alucerop@....com>,
	<ira.weiny@...el.com>, <linux-kernel@...r.kernel.org>,
	<linux-pci@...r.kernel.org>
Subject: Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be
 non-static is_aer_internal_error()

On Thu, 22 Jan 2026 13:32:08 -0800
dan.j.williams@...el.com wrote:

> Lukas Wunner wrote:
> > On Thu, Jan 22, 2026 at 11:09:48AM -0800, dan.j.williams@...el.com wrote:  
> > > Lukas Wunner wrote:  
> > > > a device possessing ECC RAM may raise
> > > > a Correctable Internal Error when ECC successfully recovers from flipped
> > > > bits because it allows alerting the user in advance that the device might
> > > > need to be replaced in the near future.  If ECC recovery fails, the device
> > > > might try to use a reserved spare portion of RAM in lieu of the failing one
> > > > and instruct the AER driver to recover through a bus reset.  Such errors
> > > > are not covered by the spec-defined types.  Using the Internal Error type
> > > > is the only possibility it seems.  
> > > 
> > > The Internal Error type is a poor fit for that. This ECC RAM scenario is
> > > simply an internal device event, not a PCIe visible error case. Consider
> > > that CXL Memory Expanders are nothing if not "devices possessing ECC RAM"
> > > that may encounter correctable errors in that RAM. Yes, the user has need
> > > for those correctable errors to be reported, and no, PCIe AER has no reason
> > > to care about conveying those reports.  
> > 
> > I'm not aware of a better PCIe spec-defined mechanism to report such
> > errors besides AER (Advanced Error *Reporting*), so I'm not sure why
> > you consider it a poor fit.  
> 
> PCIe spec has no role defining the internal error model of devices.
> Linux has reason to not endorse a blurring of the lines of where the
> PCIe error model ends and the device-specific error model begins. CXL
> respects those boundaries, Xe is pushing the boundary.

FWIW we have a bunch of older hardware where we could report this sort
of error either via AER or via an MSI. After some push back years
ago, we flipped them all to the MSI path. That includes stuff that
triggers device resets.  I don't think it caused us too much trouble
to make that switch.

> 
> > However, reporting corrected ECC errors is only half of the equation.
> > As stated above, if the ECC error is not correctable, the device may
> > choose to replace the faulty memory region with reserved spare memory,
> > but then a reset is required to recover from the error.  Precisely what
> > the AER driver provides, so again I'm not sure why it's a poor fit.  
> 
> Again CXL has a model for this, those are the "post-package repair"
> events handled internally to the device / driver either transparently or
> user coordinated. No AER needed. In general devices have plenty of
> reasons that the driver determines they need to be reset, they do not
> need AER core help to reset themselves on error.
> 
> AER is there for link recovery.
> 
> > > So if CXL saw no need to architect internal ECC events into AER, why does Xe
> > > think it is special in this regard?  
> > 
> > The most charitable interpretation is that it's just the first mover
> > and others will follow.  Well actually CXL is the first mover. ;)  
> 
> ...first mover that helps clarify the role of AER that just happens to
> match the status quo that PCIe AER core ignore internal errors.
> 
> > > The CXL solution is simply a typical device interrupt that notifies
> > > new entries in the device event log. See trace_cxl_dram() and
> > > trace_cxl_general_media() for that event handling.  
> > 
> > This seems to be based on CPER, which is not part of the PCIe Base Spec.
> > I can only guess that xe devices are intended to be used on non-ACPI
> > platforms as well, which may have led to the decision to use a
> > PCIe spec-defined mechanism.  
> 
> CPER is compatibility hack for operating systems that do not have native
> CXL drivers. The native support is just an interrupt fronting an event
> log retrieved with mailbox commands.
Just as a side note, CXL also has FW specific interrupts with a negotation
process for whether they are used, or MSI-X is used for event queues.

Jonathan