[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260203154039.000001ae@huawei.com>
Date: Tue, 3 Feb 2026 15:40:39 +0000
From: Jonathan Cameron <jonathan.cameron@...wei.com>
To: Terry Bowman <terry.bowman@....com>
CC: <dave@...olabs.net>, <dave.jiang@...el.com>, <alison.schofield@...el.com>,
<dan.j.williams@...el.com>, <bhelgaas@...gle.com>, <shiju.jose@...wei.com>,
<ming.li@...omail.com>, <Smita.KoralahalliChannabasappa@....com>,
<rrichter@....com>, <dan.carpenter@...aro.org>,
<PradeepVineshReddy.Kodamati@....com>, <lukas@...ner.de>,
<Benjamin.Cheatham@....com>, <sathyanarayanan.kuppuswamy@...ux.intel.com>,
<linux-cxl@...r.kernel.org>, <vishal.l.verma@...el.com>, <alucerop@....com>,
<ira.weiny@...el.com>, <linux-kernel@...r.kernel.org>,
<linux-pci@...r.kernel.org>
Subject: Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error
flow
On Mon, 2 Feb 2026 20:52:40 -0600
Terry Bowman <terry.bowman@....com> wrote:
> Introduce CXL Port protocol error handling callbacks to unify detection,
> logging, and recovery across CXL Ports and Endpoints, including RCH
> downstream ports. Establish a consistent flow for correctable and
> uncorrectable CXL protocol errors.
>
> Provide the solution by adding cxl_port_cor_error_detected() and
> cxl_port_error_detected() to handle correctable and uncorrectable handling
> through CXL RAS helpers, coordinating uncorrectable recovery in
> cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
> to preserve fatal cachemem behavior. Gate endpoint handling on the endpoint
> driver being bound to avoid processing errors on disabled devices.
>
> Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
> downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
> for Upstream Ports/Endpoints.
>
> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
> cxl_core to clear PCIe/AER state in these flows.
>
> Signed-off-by: Terry Bowman <terry.bowman@....com>
> Acked-by: Bjorn Helgaas <bhelgaas@...gle.com>
> Reviewed-by: Dave Jiang dave.jiang@...el.com
Hi Terry,
A few comments inline.
Thanks,
Jonathan
>
> ---
>
> Changes in v14->v15:
> - Update commit message and title. Added Bjorn's ack.
> - Move CE and UCE handling logic here
>
> Changes in v13->v14:
> - Add Dave Jiang's review-by
> - Update commit message & headline (Bjorn)
> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
> one line (Jonathan)
> - Remove cxl_walk_port() (Dan)
> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
> sufficient (Dan)
> - Remove device_lock_if()
> - Combined CE and UCE here (Terry)
>
> Changes in v12->v13:
> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
> patch (Terry)
> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> - Remove check for dport->dport_dev (Dave)
> - Remove whitespace (Terry)
>
> Changes in v11->v12:
> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
> pci_to_cxl_dev()
> - Change cxl_error_detected() -> cxl_cor_error_detected()
> - Remove NULL variable assignments
> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
> port searches.
>
> Changes in v10->v11:
> - None
> ---
> drivers/cxl/core/ras.c | 134 +++++++++++++++++++++++++++++++++++++++++
> drivers/pci/pci.c | 1 +
> drivers/pci/pci.h | 2 -
> drivers/pci/pcie/aer.c | 1 +
> include/linux/aer.h | 2 +
> include/linux/pci.h | 2 +
> 6 files changed, 140 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index a6c0bc6d7203..0216dafa6118 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -218,6 +218,68 @@ static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> return NULL;
> }
>
> +static void __iomem *cxl_get_ras_base(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + switch (pci_pcie_type(pdev)) {
> + case PCI_EXP_TYPE_ROOT_PORT:
> + case PCI_EXP_TYPE_DOWNSTREAM:
> + {
> + struct cxl_dport *dport;
struct cxl_dport *dport = NULL;
> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
as if this failed, dport is not written. Alternative is check port, not dport as port
will always be initialized whether or not failure occurs in find_cxl_port()
> +
> + if (!dport) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return dport->regs.ras;
> + }
> + case PCI_EXP_TYPE_UPSTREAM:
> + case PCI_EXP_TYPE_ENDPOINT:
> + {
> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
> +
> + if (!port) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return port->regs.ras;
> + }
> + }
> + dev_warn_once(dev, "Error: Unsupported device type (%#x)", pci_pcie_type(pdev));
> + return NULL;
> +}
> void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> {
> void __iomem *addr;
> @@ -288,6 +350,60 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> return true;
> }
>
> +static void cxl_port_cor_error_detected(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +
> + if (is_cxl_endpoint(port)) {
> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + guard(device)(&cxlmd->dev);
Maybe add a comment on why this lock needs to be held and then why the dev->drvier
below needs to be true.
> +
> + if (!dev->driver) {
> + dev_warn(&pdev->dev,
> + "%s: memdev disabled, abort error handling\n",
> + dev_name(dev));
Same question as below on why pdev->dev / dev_name(dev) here.
Maybe pci_warn() is more appropriate.
> + return;
> + }
> +
> + if (cxlds->rcd)
> + cxl_handle_rdport_errors(cxlds);
> +
> + cxl_handle_cor_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
> + } else {
> + cxl_handle_cor_ras(dev, 0, cxl_get_ras_base(dev));
> + }
> +}
> +
> +static pci_ers_result_t cxl_port_error_detected(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +
> + if (is_cxl_endpoint(port)) {
> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + guard(device)(&cxlmd->dev);
> +
> + if (!dev->driver) {
> + dev_warn(&pdev->dev,
Somewhat circular. dev_warn() will print the device name anyway I think and
this pdev->dev == dev here so might as well use that.
Or was intent to use different devices?
> + "%s: memdev disabled, abort error handling\n",
> + dev_name(dev));
> + return PCI_ERS_RESULT_NONE;
> + }
> +
> + if (cxlds->rcd)
> + cxl_handle_rdport_errors(cxlds);
> +
> + return cxl_handle_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
> + } else {
> + return cxl_handle_ras(dev, 0, cxl_get_ras_base(dev));
> + }
> +}
>
> static void cxl_proto_err_work_fn(struct work_struct *work)
Powered by blists - more mailing lists