[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250612123650.0000321b@huawei.com>
Date: Thu, 12 Jun 2025 12:36:50 +0100
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Terry Bowman <terry.bowman@....com>
CC: <PradeepVineshReddy.Kodamati@....com>, <dave@...olabs.net>,
<dave.jiang@...el.com>, <alison.schofield@...el.com>,
<vishal.l.verma@...el.com>, <ira.weiny@...el.com>,
<dan.j.williams@...el.com>, <bhelgaas@...gle.com>, <bp@...en8.de>,
<ming.li@...omail.com>, <shiju.jose@...wei.com>, <dan.carpenter@...aro.org>,
<Smita.KoralahalliChannabasappa@....com>, <kobayashi.da-06@...itsu.com>,
<yanfei.xu@...el.com>, <rrichter@....com>, <peterz@...radead.org>,
<colyli@...e.de>, <uaisheng.ye@...el.com>,
<fabio.m.de.francesco@...ux.intel.com>, <ilpo.jarvinen@...ux.intel.com>,
<yazen.ghannam@....com>, <linux-cxl@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>
Subject: Re: [PATCH v9 04/16] PCI/AER: Dequeue forwarded CXL error
On Tue, 3 Jun 2025 12:22:27 -0500
Terry Bowman <terry.bowman@....com> wrote:
> The AER driver is now designed to forward CXL protocol errors to the CXL
> driver. Update the CXL driver with functionality to dequeue the forwarded
> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
> error handling processing using the work received from the FIFO.
>
> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the
> AER service driver. This will begin the CXL protocol error processing
> with a call to cxl_handle_prot_error().
>
> Update cxl/core/ras.c by adding cxl_rch_handle_error_iter() that was
> previously in the AER driver.
>
> Introduce sbdf_to_pci() to take the SBDF values from 'struct cxl_prot_error_info'
> and use in discovering the erring PCI device. Make scope based reference
> increments/decrements for the discovered PCI device and the associated
> CXL device.
>
> Implement cxl_handle_prot_error() to differentiate between Restricted CXL
> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors.
> RCH errors will be processed with a call to walk the associated Root
> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex
> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
> so the CXL driver can walk the RCEC's downstream bus, searching for
> the RCiEP.
>
> VH correctable error (CE) processing will call the CXL CE handler. VH
> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
> and pci_clean_device_status() used to clean up AER status after handling.
>
> Signed-off-by: Terry Bowman <terry.bowman@....com>
Hopefully I haven't duplicated existing feedback. A few more minor things
inline.
> +
> +static struct pci_dev *sbdf_to_pci(struct cxl_prot_error_info *err_info)
That's an odd function name as the sbdf is buried.
Unless it's going to get a lot of use, I'd just put the lookup
inline and not have a 'hard to name' function. With other suggested
changes you only need (at where this is currently called)
struct pci_dev *pdev __free(pci_dev_put) =
pci_get_domain_bus_and_slot(err_info->segment, err_info->bus,
err_info->devfn);
> +{
> + unsigned int devfn = PCI_DEVFN(err_info->device,
> + err_info->function);
This makes me wonder if you should have just use devfn inside the err_info.
Is there a good reason to split them up before storing them?
IIRC ARI makes a mess anyway of their meaning when separate.
> + struct pci_dev *pdev __free(pci_dev_put) =
> + pci_get_domain_bus_and_slot(err_info->segment,
> + err_info->bus,
> + devfn);
> + return pdev;
> +}
> +
> +static void cxl_handle_prot_error(struct cxl_prot_error_info *err_info)
> +{
> + struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(sbdf_to_pci(err_info));
> +
> + if (!pdev) {
> + pr_err("Failed to find the CXL device\n");
> + return;
> + }
> +
> + /*
> + * Internal errors of an RCEC indicate an AER error in an
> + * RCH's downstream port. Check and handle them in the CXL.mem
> + * device driver.
> + */
> + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
> + return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
> +
> + if (err_info->severity == AER_CORRECTABLE) {
> + int aer = pdev->aer_cap;
> + struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> + struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
> +
> + if (aer)
> + pci_clear_and_set_config_dword(pdev,
> + aer + PCI_ERR_COR_STATUS,
> + 0, PCI_ERR_COR_INTERNAL);
> +
> + cxl_cor_error_detected(pdev);
> +
> + pcie_clear_device_status(pdev);
> + } else {
> + cxl_do_recovery(pdev);
> + }
> +}
> +
> static void cxl_prot_err_work_fn(struct work_struct *work)
> {
> + struct cxl_prot_err_work_data wd;
> +
> + while (cxl_prot_err_kfifo_get(&wd)) {
> + struct cxl_prot_error_info *err_info = &wd.err_info;
> +
> + cxl_handle_prot_error(err_info);
I'm not seeing value in the local variable.
Ignore if this code changes later and that makes more sense!
> + }
> }
>
> #else
Powered by blists - more mailing lists