[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <109042a0-3c8c-4908-8efa-be07026cf9b9@amd.com>
Date: Tue, 18 Feb 2025 09:43:06 -0600
From: "Bowman, Terry" <terry.bowman@....com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>
Cc: linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org, nifan.cxl@...il.com, dave@...olabs.net,
dave.jiang@...el.com, alison.schofield@...el.com, vishal.l.verma@...el.com,
dan.j.williams@...el.com, bhelgaas@...gle.com, mahesh@...ux.ibm.com,
ira.weiny@...el.com, oohall@...il.com, Benjamin.Cheatham@....com,
rrichter@....com, nathan.fontenot@....com,
Smita.KoralahalliChannabasappa@....com, lukas@...ner.de,
ming.li@...omail.com, PradeepVineshReddy.Kodamati@....com
Subject: Re: [PATCH v7 06/17] PCI/AER: Add CXL PCIe Port uncorrectable error
recovery in AER service driver
On 2/14/2025 9:11 AM, Jonathan Cameron wrote:
> On Tue, 11 Feb 2025 13:24:33 -0600
> Terry Bowman <terry.bowman@....com> wrote:
>
>> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
>> apply to CXL devices. Recovery can not be used for CXL devices because of
>> potential corruption on what can be system memory. Also, current PCIe UCE
>> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
>> does not begin at the RP/DSP but begins at the first downstream device.
>> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
>> CXL recovery is needed because of the different handling requirements
>>
>> Add a new function, cxl_do_recovery() using the following.
>>
>> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
>> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
>> will begin iteration at the RP or DSP rather than beginning at the
> Hi Terry,
>
> Trivial nitpick but you wrap point is shrinking wrt to the previous paragraph.
> Just looks odd rather than actually mattering :)
I'll take more notice of this in the next revision. Thanks for the feedback.
>> first downstream device.
>>
>> pci_walk_bridge() is candidate to possibly reuse cxl_walk_bridge() but
>> needs further investigation. This will be left for future improvement
>> to make the CXL and PCI handling paths more common.
>>
>> Add cxl_report_error_detected() as an analog to report_error_detected().
>> It will call pci_driver::cxl_err_handlers for each iterated downstream
>> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
>> indicating if there was a UCE error detected during handling.
>>
>> cxl_do_recovery() uses the status from cxl_report_error_detected() to
>> determine how to proceed. Non-fatal CXL UCE errors will be treated as
>> fatal. If a UCE was present during handling then cxl_do_recovery()
>> will kernel panic.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@....com>
> One trivial suggestion inline. Probably something for another day!
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
Thanks Jonathan.
>> ---
>> drivers/pci/pci.h | 3 +++
>> drivers/pci/pcie/aer.c | 4 +++
>> drivers/pci/pcie/err.c | 58 ++++++++++++++++++++++++++++++++++++++++++
>> include/linux/pci.h | 3 +++
>> 4 files changed, 68 insertions(+)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 01e51db8d285..deb193b387af 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -722,6 +722,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>> pci_channel_state_t state,
>> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
>>
>> +/* CXL error reporting and handling */
>> +void cxl_do_recovery(struct pci_dev *dev);
>> +
>> bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
>> int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 34ec0958afff..ee38db08d005 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1012,6 +1012,8 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>> err_handler->error_detected(dev, pci_channel_io_normal);
>> else if (info->severity == AER_FATAL)
>> err_handler->error_detected(dev, pci_channel_io_frozen);
>> +
>> + cxl_do_recovery(dev);
>> }
>> out:
>> device_unlock(&dev->dev);
>> @@ -1041,6 +1043,8 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>> pdrv->cxl_err_handler->cor_error_detected(dev);
>>
>> pcie_clear_device_status(dev);
>> + } else {
>> + cxl_do_recovery(dev);
>> }
>> }
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 31090770fffc..05f2d1ef4c36 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -24,6 +24,9 @@
>> static pci_ers_result_t merge_result(enum pci_ers_result orig,
>> enum pci_ers_result new)
>> {
>> + if (new == PCI_ERS_RESULT_PANIC)
>> + return PCI_ERS_RESULT_PANIC;
>> +
>> if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>> return PCI_ERS_RESULT_NO_AER_DRIVER;
>>
>> @@ -276,3 +279,58 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>
>> return status;
>> }
>> +
>> +static void cxl_walk_bridge(struct pci_dev *bridge,
>> + int (*cb)(struct pci_dev *, void *),
>> + void *userdata)
>> +{
>> + if (cb(bridge, userdata))
>> + return;
>> +
>> + if (bridge->subordinate)
>> + pci_walk_bus(bridge->subordinate, cb, userdata);
>> +}
>> +
>> +static int cxl_report_error_detected(struct pci_dev *dev, void *data)
>> +{
>> + const struct cxl_error_handlers *cxl_err_handler;
>> + pci_ers_result_t vote, *result = data;
>> + struct pci_driver *pdrv;
>> +
>> + device_lock(&dev->dev);
> Could use
> guard(device)(&dev->dev);
>
>> + pdrv = dev->driver;
>> + if (!pdrv || !pdrv->cxl_err_handler ||
>> + !pdrv->cxl_err_handler->error_detected)
>> + goto out;
> allowing you to return here.
>
> Same approach would simplify the rch code as well.
Yes, I'll change to use a guard() here. I'll have to use the CXL device (not the PCI
device) as Dan pointed out. Also, this will be moved out of AER driver and into CXL core.
Regards,
Terry
>> +
>> + cxl_err_handler = pdrv->cxl_err_handler;
>> + vote = cxl_err_handler->error_detected(dev);
>> + *result = merge_result(*result, vote);
>> +out:
>> + device_unlock(&dev->dev);
>> + return 0;
>> +}
>> +
>> +void cxl_do_recovery(struct pci_dev *dev)
>> +{
>> + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>> + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>> +
>> + cxl_walk_bridge(dev, cxl_report_error_detected, &status);
>> + if (status == PCI_ERS_RESULT_PANIC)
>> + panic("CXL cachemem error.");
>> +
>> + /*
>> + * If we have native control of AER, clear error status in the device
>> + * that detected the error. If the platform retained control of AER,
>> + * it is responsible for clearing this status. In that case, the
>> + * signaling device may not even be visible to the OS.
>> + */
>> + if (host->native_aer || pcie_ports_native) {
>> + pcie_clear_device_status(dev);
>> + pci_aer_clear_nonfatal_status(dev);
>> + pci_aer_clear_fatal_status(dev);
>> + }
>> +
>> + pci_info(dev, "CXL uncorrectable error.\n");
>> +}
>
Powered by blists - more mailing lists