[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4db30968-42a4-449c-9269-4817e4c89a46@intel.com>
Date: Mon, 9 Jun 2025 13:17:23 -0700
From: Dave Jiang <dave.jiang@...el.com>
To: "Bowman, Terry" <terry.bowman@....com>,
PradeepVineshReddy.Kodamati@....com, dave@...olabs.net,
jonathan.cameron@...wei.com, alison.schofield@...el.com,
vishal.l.verma@...el.com, ira.weiny@...el.com, dan.j.williams@...el.com,
bhelgaas@...gle.com, bp@...en8.de, ming.li@...omail.com,
shiju.jose@...wei.com, dan.carpenter@...aro.org,
Smita.KoralahalliChannabasappa@....com, kobayashi.da-06@...itsu.com,
yanfei.xu@...el.com, rrichter@....com, peterz@...radead.org,
coly.li@...e.de, uaisheng.ye@...el.com,
fabio.m.de.francesco@...ux.intel.com, ilpo.jarvinen@...ux.intel.com,
yazen.ghannam@....com, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH v9 04/16] PCI/AER: Dequeue forwarded CXL error
On 6/6/25 4:15 PM, Bowman, Terry wrote:
>
>
> On 6/6/2025 10:57 AM, Dave Jiang wrote:
>>
>> On 6/3/25 10:22 AM, Terry Bowman wrote:
>>> The AER driver is now designed to forward CXL protocol errors to the CXL
>>> driver. Update the CXL driver with functionality to dequeue the forwarded
>>> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
>>> error handling processing using the work received from the FIFO.
>>>
>>> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the
>>> AER service driver. This will begin the CXL protocol error processing
>>> with a call to cxl_handle_prot_error().
>>>
>>> Update cxl/core/ras.c by adding cxl_rch_handle_error_iter() that was
>>> previously in the AER driver.
>>>
>>> Introduce sbdf_to_pci() to take the SBDF values from 'struct cxl_prot_error_info'
>>> and use in discovering the erring PCI device. Make scope based reference
>>> increments/decrements for the discovered PCI device and the associated
>>> CXL device.
>>>
>>> Implement cxl_handle_prot_error() to differentiate between Restricted CXL
>>> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors.
>>> RCH errors will be processed with a call to walk the associated Root
>>> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex
>>> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
>>> so the CXL driver can walk the RCEC's downstream bus, searching for
>>> the RCiEP.
>>>
>>> VH correctable error (CE) processing will call the CXL CE handler. VH
>>> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
>>> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
>>> and pci_clean_device_status() used to clean up AER status after handling.
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@....com>
>>> ---
>>> drivers/cxl/core/ras.c | 92 +++++++++++++++++++++++++++++++++++++++++
>>> drivers/pci/pci.c | 1 +
>>> drivers/pci/pci.h | 8 ----
>>> drivers/pci/pcie/aer.c | 1 +
>>> drivers/pci/pcie/rcec.c | 1 +
>>> include/linux/aer.h | 2 +
>>> include/linux/pci.h | 10 +++++
>>> 7 files changed, 107 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>> index d35525e79e04..9ed5c682e128 100644
>>> --- a/drivers/cxl/core/ras.c
>>> +++ b/drivers/cxl/core/ras.c
>>> @@ -110,8 +110,100 @@ static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>>>
>>> #ifdef CONFIG_PCIEAER_CXL
>>>
>>> +static void cxl_do_recovery(struct pci_dev *pdev)
>>> +{
>>> +}
>>> +
>>> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
>>> +{
>>> + struct cxl_prot_error_info *err_info = data;
>>> + struct pci_dev *pdev_ref __free(pci_dev_put) = pci_dev_get(pdev);
>>> + struct cxl_dev_state *cxlds;
>>> +
>>> + /*
>>> + * The capability, status, and control fields in Device 0,
>>> + * Function 0 DVSEC control the CXL functionality of the
>>> + * entire device (CXL 3.0, 8.1.3).
>>> + */
>>> + if (pdev->devfn != PCI_DEVFN(0, 0))
>>> + return 0;
>>> +
>>> + /*
>>> + * CXL Memory Devices must have the 502h class code set (CXL
>>> + * 3.0, 8.1.12.1).
>>> + */
>>> + if ((pdev->class >> 8) != PCI_CLASS_MEMORY_CXL)
>> Should use FIELD_GET() to be consistent with the rest of CXL code base
>>
>>> + return 0;
>>> +
>>> + if (!is_cxl_memdev(&pdev->dev) || !pdev->dev.driver)
>> I think you need to hold the pdev->dev lock while checking if the driver exists.
> Hi Dave,
>
> Wouldn't a reference count increment prevent the driver from being unbound and thus
> make this access here to the driver safe (given the pci_dev_get() above)? And a lock
> would prevent concurrent access with a busy wait when the driver executes the next
> lock take?
Actually nothing prevents a driver from being unbound unless you are holding the device lock. Because device core needs the device lock in order to call driver removal [1]. So if you acquire the lock, either the driver is still bound and you are ok, or the driver is gone and there's nothing to do. The incremented refcount prevents ->release() of the device and the memory allocated for the device from being freed based on kref behavior [2].
[1]: https://elixir.bootlin.com/linux/v6.15.1/source/drivers/base/dd.c#L1292
[2]: https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/kref.h#L48
DJ
>
> Terry
>
> [snip]
>
Powered by blists - more mailing lists