linux-kernel - Re: [PATCH v9 04/16] PCI/AER: Dequeue forwarded CXL error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4db30968-42a4-449c-9269-4817e4c89a46@intel.com>
Date: Mon, 9 Jun 2025 13:17:23 -0700
From: Dave Jiang <dave.jiang@...el.com>
To: "Bowman, Terry" <terry.bowman@....com>,
 PradeepVineshReddy.Kodamati@....com, dave@...olabs.net,
 jonathan.cameron@...wei.com, alison.schofield@...el.com,
 vishal.l.verma@...el.com, ira.weiny@...el.com, dan.j.williams@...el.com,
 bhelgaas@...gle.com, bp@...en8.de, ming.li@...omail.com,
 shiju.jose@...wei.com, dan.carpenter@...aro.org,
 Smita.KoralahalliChannabasappa@....com, kobayashi.da-06@...itsu.com,
 yanfei.xu@...el.com, rrichter@....com, peterz@...radead.org,
 coly.li@...e.de, uaisheng.ye@...el.com,
 fabio.m.de.francesco@...ux.intel.com, ilpo.jarvinen@...ux.intel.com,
 yazen.ghannam@....com, linux-cxl@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH v9 04/16] PCI/AER: Dequeue forwarded CXL error



On 6/6/25 4:15 PM, Bowman, Terry wrote:
> 
> 
> On 6/6/2025 10:57 AM, Dave Jiang wrote:
>>
>> On 6/3/25 10:22 AM, Terry Bowman wrote:
>>> The AER driver is now designed to forward CXL protocol errors to the CXL
>>> driver. Update the CXL driver with functionality to dequeue the forwarded
>>> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
>>> error handling processing using the work received from the FIFO.
>>>
>>> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the
>>> AER service driver. This will begin the CXL protocol error processing
>>> with a call to cxl_handle_prot_error().
>>>
>>> Update cxl/core/ras.c by adding cxl_rch_handle_error_iter() that was
>>> previously in the AER driver.
>>>
>>> Introduce sbdf_to_pci() to take the SBDF values from 'struct cxl_prot_error_info'
>>> and use in discovering the erring PCI device. Make scope based reference
>>> increments/decrements for the discovered PCI device and the associated
>>> CXL device.
>>>
>>> Implement cxl_handle_prot_error() to differentiate between Restricted CXL
>>> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors.
>>> RCH errors will be processed with a call to walk the associated Root
>>> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex
>>> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
>>> so the CXL driver can walk the RCEC's downstream bus, searching for
>>> the RCiEP.
>>>
>>> VH correctable error (CE) processing will call the CXL CE handler. VH
>>> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
>>> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
>>> and pci_clean_device_status() used to clean up AER status after handling.
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@....com>
>>> ---
>>>  drivers/cxl/core/ras.c  | 92 +++++++++++++++++++++++++++++++++++++++++
>>>  drivers/pci/pci.c       |  1 +
>>>  drivers/pci/pci.h       |  8 ----
>>>  drivers/pci/pcie/aer.c  |  1 +
>>>  drivers/pci/pcie/rcec.c |  1 +
>>>  include/linux/aer.h     |  2 +
>>>  include/linux/pci.h     | 10 +++++
>>>  7 files changed, 107 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>> index d35525e79e04..9ed5c682e128 100644
>>> --- a/drivers/cxl/core/ras.c
>>> +++ b/drivers/cxl/core/ras.c
>>> @@ -110,8 +110,100 @@ static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>>>  
>>>  #ifdef CONFIG_PCIEAER_CXL
>>>  
>>> +static void cxl_do_recovery(struct pci_dev *pdev)
>>> +{
>>> +}
>>> +
>>> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
>>> +{
>>> +	struct cxl_prot_error_info *err_info = data;
>>> +	struct pci_dev *pdev_ref __free(pci_dev_put) = pci_dev_get(pdev);
>>> +	struct cxl_dev_state *cxlds;
>>> +
>>> +	/*
>>> +	 * The capability, status, and control fields in Device 0,
>>> +	 * Function 0 DVSEC control the CXL functionality of the
>>> +	 * entire device (CXL 3.0, 8.1.3).
>>> +	 */
>>> +	if (pdev->devfn != PCI_DEVFN(0, 0))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * CXL Memory Devices must have the 502h class code set (CXL
>>> +	 * 3.0, 8.1.12.1).
>>> +	 */
>>> +	if ((pdev->class >> 8) != PCI_CLASS_MEMORY_CXL)
>> Should use FIELD_GET() to be consistent with the rest of CXL code base
>>
>>> +		return 0;
>>> +
>>> +	if (!is_cxl_memdev(&pdev->dev) || !pdev->dev.driver)
>> I think you need to hold the pdev->dev lock while checking if the driver exists.
> Hi Dave,
> 
> Wouldn't a reference count increment prevent the driver from being unbound and thus
> make this access here to the driver safe (given the pci_dev_get() above)? And a lock
> would prevent concurrent access with a busy wait when the driver executes the next
> lock take?

Actually nothing prevents a driver from being unbound unless you are holding the device lock. Because device core needs the device lock in order to call driver removal [1]. So if you acquire the lock, either the driver is still bound and you are ok, or the driver is gone and there's nothing to do. The incremented refcount prevents ->release() of the device and the memory allocated for the device from being freed based on kref behavior [2].

[1]: https://elixir.bootlin.com/linux/v6.15.1/source/drivers/base/dd.c#L1292
[2]: https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/kref.h#L48

DJ

> 
> Terry
> 
> [snip]
>