[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXInXZiTCSN06si8@wunner.de>
Date: Thu, 22 Jan 2026 14:34:21 +0100
From: Lukas Wunner <lukas@...ner.de>
To: dan.j.williams@...el.com
Cc: Jonathan Cameron <jonathan.cameron@...wei.com>,
Terry Bowman <terry.bowman@....com>, dave@...olabs.net,
dave.jiang@...el.com, alison.schofield@...el.com,
bhelgaas@...gle.com, shiju.jose@...wei.com, ming.li@...omail.com,
Smita.KoralahalliChannabasappa@....com, rrichter@....com,
dan.carpenter@...aro.org, PradeepVineshReddy.Kodamati@....com,
Benjamin.Cheatham@....com,
sathyanarayanan.kuppuswamy@...ux.intel.com,
linux-cxl@...r.kernel.org, vishal.l.verma@...el.com,
alucerop@....com, ira.weiny@...el.com, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org
Subject: Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be
non-static is_aer_internal_error()
On Thu, Jan 15, 2026 at 12:42:36PM -0800, dan.j.williams@...el.com wrote:
> I agree with the general sentiment, but not the conclusion, especially
> because this is a private detail. Linux has long ignored internal
> errors. The only reason to consider them now is because CXL decided to
> multiplex its error model on top of this oft-ignored feature of PCIe
> AER.
>
> Specifically, portdrv.h is not in the global include namespace, this is
> a private detail of the only conumer of internal errors:
> drivers/pci/pcie/aer_cxl_{rch,vh}.c
>
> At most we should have this as a comment to clarify:
>
> /*
> * Note, internal errors are only considered for the CXL error model,
> * not for other implementations.
> */
>
> ...and the pci_aer_unmask_internal_errors() export should be:
>
> EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core")
>
> ...for the same reason. Steer folks away from thinking that it is open
> season for adding more internal error support.
It's not like Internal Errors are a bad thing per se. They're a way
to signal "other" errors besides the spec-defined ones.
As an example, and I'm keeping this in general terms to avoid devulging
information about future products, a device possessing ECC RAM may raise
a Correctable Internal Error when ECC successfully recovers from flipped
bits because it allows alerting the user in advance that the device might
need to be replaced in the near future. If ECC recovery fails, the device
might try to use a reserved spare portion of RAM in lieu of the failing one
and instruct the AER driver to recover through a bus reset. Such errors
are not covered by the spec-defined types. Using the Internal Error type
is the only possibility it seems.
My point is, there are valid (upcoming, not theoretical) use cases for
Internal Errors and creating infrastructure in the kernel to take advantage
of them is a good thing. Hence my continued pushing back on hiding or
discouraging their use.
Thanks,
Lukas
Powered by blists - more mailing lists