linux-kernel - Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aXJ7VKwM6xfH-42L@wunner.de>
Date: Thu, 22 Jan 2026 20:32:36 +0100
From: Lukas Wunner <lukas@...ner.de>
To: dan.j.williams@...el.com
Cc: Jonathan Cameron <jonathan.cameron@...wei.com>,
	Terry Bowman <terry.bowman@....com>, dave@...olabs.net,
	dave.jiang@...el.com, alison.schofield@...el.com,
	bhelgaas@...gle.com, shiju.jose@...wei.com, ming.li@...omail.com,
	Smita.KoralahalliChannabasappa@....com, rrichter@....com,
	dan.carpenter@...aro.org, PradeepVineshReddy.Kodamati@....com,
	Benjamin.Cheatham@....com,
	sathyanarayanan.kuppuswamy@...ux.intel.com,
	linux-cxl@...r.kernel.org, vishal.l.verma@...el.com,
	alucerop@....com, ira.weiny@...el.com, linux-kernel@...r.kernel.org,
	linux-pci@...r.kernel.org
Subject: Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be
 non-static is_aer_internal_error()

On Thu, Jan 22, 2026 at 11:09:48AM -0800, dan.j.williams@...el.com wrote:
> Lukas Wunner wrote:
> > a device possessing ECC RAM may raise
> > a Correctable Internal Error when ECC successfully recovers from flipped
> > bits because it allows alerting the user in advance that the device might
> > need to be replaced in the near future.  If ECC recovery fails, the device
> > might try to use a reserved spare portion of RAM in lieu of the failing one
> > and instruct the AER driver to recover through a bus reset.  Such errors
> > are not covered by the spec-defined types.  Using the Internal Error type
> > is the only possibility it seems.
> 
> The Internal Error type is a poor fit for that. This ECC RAM scenario is
> simply an internal device event, not a PCIe visible error case. Consider
> that CXL Memory Expanders are nothing if not "devices possessing ECC RAM"
> that may encounter correctable errors in that RAM. Yes, the user has need
> for those correctable errors to be reported, and no, PCIe AER has no reason
> to care about conveying those reports.

I'm not aware of a better PCIe spec-defined mechanism to report such
errors besides AER (Advanced Error *Reporting*), so I'm not sure why
you consider it a poor fit.

However, reporting corrected ECC errors is only half of the equation.
As stated above, if the ECC error is not correctable, the device may
choose to replace the faulty memory region with reserved spare memory,
but then a reset is required to recover from the error.  Precisely what
the AER driver provides, so again I'm not sure why it's a poor fit.

> So if CXL saw no need to architect internal ECC events into AER, why does Xe
> think it is special in this regard?

The most charitable interpretation is that it's just the first mover
and others will follow.  Well actually CXL is the first mover. ;)

> The CXL solution is simply a typical device interrupt that notifies
> new entries in the device event log. See trace_cxl_dram() and
> trace_cxl_general_media() for that event handling.

This seems to be based on CPER, which is not part of the PCIe Base Spec.
I can only guess that xe devices are intended to be used on non-ACPI
platforms as well, which may have led to the decision to use a
PCIe spec-defined mechanism.

Thanks,

Lukas