linux-kernel - Re: [PATCH v14 32/34] cxl: Update Endpoint uncorrectable protocol error handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <696813ae1d175_34d2a1001a@dwillia2-mobl4.notmuch>
Date: Wed, 14 Jan 2026 14:07:42 -0800
From: <dan.j.williams@...el.com>
To: Terry Bowman <terry.bowman@....com>, <dave@...olabs.net>,
	<jonathan.cameron@...wei.com>, <dave.jiang@...el.com>,
	<alison.schofield@...el.com>, <dan.j.williams@...el.com>,
	<bhelgaas@...gle.com>, <shiju.jose@...wei.com>, <ming.li@...omail.com>,
	<Smita.KoralahalliChannabasappa@....com>, <rrichter@....com>,
	<dan.carpenter@...aro.org>, <PradeepVineshReddy.Kodamati@....com>,
	<lukas@...ner.de>, <Benjamin.Cheatham@....com>,
	<sathyanarayanan.kuppuswamy@...ux.intel.com>, <linux-cxl@...r.kernel.org>,
	<vishal.l.verma@...el.com>, <alucerop@....com>, <ira.weiny@...el.com>
CC: <linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>,
	<terry.bowman@....com>
Subject: Re: [PATCH v14 32/34] cxl: Update Endpoint uncorrectable protocol
 error handling

Terry Bowman wrote:
> The CXL drivers must support handling Endpoint CXL and PCI uncorrectable
> (UCE) protocol errors. Update the drivers to support both.
> 
> Introduce cxl_pci_error_detected() to handle PCI correctable errors,
> replacing cxl_error_detected(). Implement this new function to call
> the existing CXL Port uncorrectable handler, cxl_port_error_detected().
> 
> Update cxl_port_error_detected() for Endpoint handling. Take the CXL
> memory device lock, check for a valid driver, and handle restricted
> CXL device (RCH) if needed. This is the same sequence initially in
> cxl_error_detected(). But, the UCE handler's logic for the returned
> result errors is simplified because recovery will not be tried and
> instead UCE's will result in the CXL driver invoking system panic.
> 
> Signed-off-by: Terry Bowman <terry.bowman@....com>
> 
> ---
> 
> Changes in v13->v14:
> - Update commit headline (Bjorn)
> - Rename pci_error_detected()/pci_cor_error_detected() ->
>   cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
> - Split into separate patches for UCE and CE (Terry)
> 
> Changes in v12->v13:
> - Update commit messaqge (Terry)
> - Updated all the implementation and commit message. (Terry)
> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
>   pdev (Dave Jiang)
> 
> Changes in v11->v12:
> - None
> 
> Changes in v10->v11:
> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
> - cxl_error_detected() - Remove extra line (Shiju)
> - Changes moved to core/ras.c (Terry)
> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
> - Move #include "pci.h from cxl.h to core.h (Terry)
> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
[..]
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 96ce85cc0a46..dc6e02d64821 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
[..]
> @@ -373,55 +399,21 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>  
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> -				    pci_channel_state_t state)
> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> +					pci_channel_state_t error)
>  {
> -	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> -	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> -	struct device *dev = &cxlmd->dev;
> -	bool ue;
> +	struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +	pci_ers_result_t rc;
>  
> -	guard(device)(dev);
> +	guard(device)(&port->dev);
>  
> -	if (!dev->driver) {
> -		dev_warn(&pdev->dev,
> -			 "%s: memdev disabled, abort error handling\n",
> -			 dev_name(dev));
> -		return PCI_ERS_RESULT_DISCONNECT;
> -	}
> +	rc = cxl_port_error_detected(&pdev->dev);
> +	if (rc == PCI_ERS_RESULT_PANIC)
> +		panic("CXL cachemem error.");
[..]
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index acb0eb2a13c3..ff741adc7c7f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
>  	}
>  }
>  
> -static const struct pci_error_handlers cxl_error_handlers = {
> -	.error_detected	= cxl_error_detected,
> +static const struct pci_error_handlers pci_error_handlers = {
> +	.error_detected	= cxl_pci_error_detected,

I still feel like we are disconnected on the fundamental question of who
is responsible for invoking CXL protocol error handling.

To be clear, all of this:

  cxl/port: Remove "enumerate dports" helpers
  cxl/port: Fix devm resource leaks around with dport management
  cxl/port: Move dport operations to a driver event
  cxl/port: Move dport RAS reporting to a port resource
  cxl/port: Move endpoint component register management to cxl_port
  cxl/port: Unify endpoint and switch port lookup

Was with the intent that cxl_pci and any other driver that creates a
cxl_memdev never needs to worry about CXL protocol error handling. It
comes "for free" by registering a "struct cxl_memdev".

This is the rationale for "struct pci_dev" to grow an "is_cxl"
attribute, and for the PCI core to learn how to forward PCIE internal
errors on CXL devices to the CXL core.

The only errors that cxl_pci needs to worry about are non-internal /
native PCI errors. All CXL errors will have already been routed to the
CXL core for generic handling based on a port lookup.

So the end state I am looking for is no call to
cxl_port_error_detected() from any 'struct pci_error_handlers'
implementation. Untangle that ambiguity in the AER core and do not
inflict it on every CXL driver that comes after.

I think we are close to that outcome if not already there by simply
deleting this last cxl_pci_error_detected() -> cxl_port_error_detected()
"false dependency".

Now, if an endpoint driver ever thinks it can do anything sane with CXL
protocol error beyond what the core is already handling, then we can
think about complications like passing a cxl_port error handler
template. I struggle to think of a case like that.