[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67869e8427ff9_186d9b29437@iweiny-mobl.notmuch>
Date: Tue, 14 Jan 2025 11:27:32 -0600
From: Ira Weiny <ira.weiny@...el.com>
To: Terry Bowman <terry.bowman@....com>, <linux-cxl@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>,
<nifan.cxl@...il.com>, <dave@...olabs.net>, <jonathan.cameron@...wei.com>,
<dave.jiang@...el.com>, <alison.schofield@...el.com>,
<vishal.l.verma@...el.com>, <dan.j.williams@...el.com>,
<bhelgaas@...gle.com>, <mahesh@...ux.ibm.com>, <ira.weiny@...el.com>,
<oohall@...il.com>, <Benjamin.Cheatham@....com>, <rrichter@....com>,
<nathan.fontenot@....com>, <Smita.KoralahalliChannabasappa@....com>,
<lukas@...ner.de>, <ming.li@...omail.com>,
<PradeepVineshReddy.Kodamati@....com>, <alucerop@....com>
Subject: Re: [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error
recovery in AER service driver
Terry Bowman wrote:
> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
> apply to CXL devices. Recovery can not be used for CXL devices because of
> potential corruption on what can be system memory. Also, current PCIe UCE
> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
> does not begin at the RP/DSP but begins at the first downstream device.
> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
> CXL recovery is needed because of the different handling requirements
>
> Add a new function, cxl_do_recovery() using the following.
>
> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
> will begin iteration at the RP or DSP rather than beginning at the
> first downstream device.
>
> Add cxl_report_error_detected() as an analog to report_error_detected().
> It will call pci_driver::cxl_err_handlers for each iterated downstream
> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
> indicating if there was a UCE error detected during handling.
>
> cxl_do_recovery() uses the status from cxl_report_error_detected() to
> determine how to proceed. Non-fatal CXL UCE errors will be treated as
> fatal. If a UCE was present during handling then cxl_do_recovery()
> will kernel panic.
>
> Signed-off-by: Terry Bowman <terry.bowman@....com>
> ---
[snip]
> +
> +static int cxl_report_error_detected(struct pci_dev *dev, void *data)
> +{
> + const struct cxl_error_handlers *cxl_err_handler;
> + struct pci_driver *pdrv = dev->driver;
> + bool *status = data;
> +
> + device_lock(&dev->dev);
> + if (pdrv && pdrv->cxl_err_handler &&
> + pdrv->cxl_err_handler->error_detected) {
> + cxl_err_handler = pdrv->cxl_err_handler;
> + *status = cxl_err_handler->error_detected(dev);
> + }
> + device_unlock(&dev->dev);
> + return *status;
This is probably just another nit on my part but returning bool here for
int may cause issues down the road.
Looking at this I wonder if it would be better to add *_PANIC to
pci_ers_result_t and return that similar to report_error_detected()?
> +}
> +
> +void cxl_do_recovery(struct pci_dev *dev)
> +{
> + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> + int type = pci_pcie_type(dev);
> + struct pci_dev *bridge;
> + int status;
> +
> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
> + type == PCI_EXP_TYPE_DOWNSTREAM ||
> + type == PCI_EXP_TYPE_UPSTREAM ||
> + type == PCI_EXP_TYPE_ENDPOINT)
> + bridge = dev;
> + else
> + bridge = pci_upstream_bridge(dev);
> +
> + cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
> + if (status)
> + panic("CXL cachemem error.");
> +
> + if (host->native_aer || pcie_ports_native) {
> + pcie_clear_device_status(dev);
> + pci_aer_clear_nonfatal_status(dev);
> + }
There is a nice informative comment in pcie_do_recovery() about this
block. I think we should combine this and that block into a new function
which preserves that for both paths.
Ira
> +
> + pci_info(bridge, "CXL uncorrectable error.\n");
> +}
> --
> 2.34.1
>
Powered by blists - more mailing lists