[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <697275fcc1686_309510085@dwillia2-mobl4.notmuch>
Date: Thu, 22 Jan 2026 11:09:48 -0800
From: <dan.j.williams@...el.com>
To: Lukas Wunner <lukas@...ner.de>, <dan.j.williams@...el.com>
CC: Jonathan Cameron <jonathan.cameron@...wei.com>, Terry Bowman
<terry.bowman@....com>, <dave@...olabs.net>, <dave.jiang@...el.com>,
<alison.schofield@...el.com>, <bhelgaas@...gle.com>, <shiju.jose@...wei.com>,
<ming.li@...omail.com>, <Smita.KoralahalliChannabasappa@....com>,
<rrichter@....com>, <dan.carpenter@...aro.org>,
<PradeepVineshReddy.Kodamati@....com>, <Benjamin.Cheatham@....com>,
<sathyanarayanan.kuppuswamy@...ux.intel.com>, <linux-cxl@...r.kernel.org>,
<vishal.l.verma@...el.com>, <alucerop@....com>, <ira.weiny@...el.com>,
<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>
Subject: Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be
non-static is_aer_internal_error()
Lukas Wunner wrote:
> On Thu, Jan 15, 2026 at 12:42:36PM -0800, dan.j.williams@...el.com wrote:
> > I agree with the general sentiment, but not the conclusion, especially
> > because this is a private detail. Linux has long ignored internal
> > errors. The only reason to consider them now is because CXL decided to
> > multiplex its error model on top of this oft-ignored feature of PCIe
> > AER.
> >
> > Specifically, portdrv.h is not in the global include namespace, this is
> > a private detail of the only conumer of internal errors:
> > drivers/pci/pcie/aer_cxl_{rch,vh}.c
> >
> > At most we should have this as a comment to clarify:
> >
> > /*
> > * Note, internal errors are only considered for the CXL error model,
> > * not for other implementations.
> > */
> >
> > ...and the pci_aer_unmask_internal_errors() export should be:
> >
> > EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core")
> >
> > ...for the same reason. Steer folks away from thinking that it is open
> > season for adding more internal error support.
>
> It's not like Internal Errors are a bad thing per se. They're a way
> to signal "other" errors besides the spec-defined ones.
>
> As an example, and I'm keeping this in general terms to avoid devulging
> information about future products, a device possessing ECC RAM may raise
> a Correctable Internal Error when ECC successfully recovers from flipped
> bits because it allows alerting the user in advance that the device might
> need to be replaced in the near future. If ECC recovery fails, the device
> might try to use a reserved spare portion of RAM in lieu of the failing one
> and instruct the AER driver to recover through a bus reset. Such errors
> are not covered by the spec-defined types. Using the Internal Error type
> is the only possibility it seems.
The Internal Error type is a poor fit for that. This ECC RAM scenario is simply
an internal device event, not a PCIe visible error case. Consider that CXL
Memory Expanders are nothing if not "devices possessing ECC RAM" that may
encounter correctable errors in that RAM. Yes, the user has need for those
correctable errors to be reported, and no, PCIe AER has no reason to care about
conveying those reports. CXL bypasses AER for internal ECC RAM events.
PCIe AER only notices device-internal ECC RAM events in the case where a PCIe
transaction encounters an error. For example, a completer abort attempting to
pull from bad RAM.
So if CXL saw no need to architect internal ECC events into AER, why does Xe
think it is special in this regard?
The CXL solution is simply a typical device interrupt that notifies new entries
in the device event log. See trace_cxl_dram() and trace_cxl_general_media() for
that event handling.
> My point is, there are valid (upcoming, not theoretical) use cases for
> Internal Errors and creating infrastructure in the kernel to take advantage
> of them is a good thing. Hence my continued pushing back on hiding or
> discouraging their use.
It is fine to look ahead, but I would not go so far as to pull in future
requirements into a present patch set. Especially when those future
requirements are suspect.
Powered by blists - more mailing lists