linux-kernel - Re: [PATCH v9 04/16] PCI/AER: Dequeue forwarded CXL error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250612123650.0000321b@huawei.com>
Date: Thu, 12 Jun 2025 12:36:50 +0100
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Terry Bowman <terry.bowman@....com>
CC: <PradeepVineshReddy.Kodamati@....com>, <dave@...olabs.net>,
	<dave.jiang@...el.com>, <alison.schofield@...el.com>,
	<vishal.l.verma@...el.com>, <ira.weiny@...el.com>,
	<dan.j.williams@...el.com>, <bhelgaas@...gle.com>, <bp@...en8.de>,
	<ming.li@...omail.com>, <shiju.jose@...wei.com>, <dan.carpenter@...aro.org>,
	<Smita.KoralahalliChannabasappa@....com>, <kobayashi.da-06@...itsu.com>,
	<yanfei.xu@...el.com>, <rrichter@....com>, <peterz@...radead.org>,
	<colyli@...e.de>, <uaisheng.ye@...el.com>,
	<fabio.m.de.francesco@...ux.intel.com>, <ilpo.jarvinen@...ux.intel.com>,
	<yazen.ghannam@....com>, <linux-cxl@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>
Subject: Re: [PATCH v9 04/16] PCI/AER: Dequeue forwarded CXL error

On Tue, 3 Jun 2025 12:22:27 -0500
Terry Bowman <terry.bowman@....com> wrote:

> The AER driver is now designed to forward CXL protocol errors to the CXL
> driver. Update the CXL driver with functionality to dequeue the forwarded
> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
> error handling processing using the work received from the FIFO.
> 
> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the
> AER service driver. This will begin the CXL protocol error processing
> with a call to cxl_handle_prot_error().
> 
> Update cxl/core/ras.c by adding cxl_rch_handle_error_iter() that was
> previously in the AER driver.
> 
> Introduce sbdf_to_pci() to take the SBDF values from 'struct cxl_prot_error_info'
> and use in discovering the erring PCI device. Make scope based reference
> increments/decrements for the discovered PCI device and the associated
> CXL device.
> 
> Implement cxl_handle_prot_error() to differentiate between Restricted CXL
> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors.
> RCH errors will be processed with a call to walk the associated Root
> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex
> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
> so the CXL driver can walk the RCEC's downstream bus, searching for
> the RCiEP.
> 
> VH correctable error (CE) processing will call the CXL CE handler. VH
> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
> and pci_clean_device_status() used to clean up AER status after handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@....com>

Hopefully I haven't duplicated existing feedback. A few more minor things
inline.

> +
> +static struct pci_dev *sbdf_to_pci(struct cxl_prot_error_info *err_info)

That's an odd function name as the sbdf is buried.
Unless it's going to get a lot of use, I'd just put the lookup
inline and not have a 'hard to name' function.  With other suggested
changes you only need (at where this is currently called)
	struct pci_dev *pdev __free(pci_dev_put) =
		pci_get_domain_bus_and_slot(err_info->segment, err_info->bus,
					    err_info->devfn);

> +{
> +	unsigned int devfn = PCI_DEVFN(err_info->device,
> +				       err_info->function);

This makes me wonder if you should have just use devfn inside the err_info.
Is there a good reason to split them up before storing them?

IIRC ARI makes a mess anyway of their meaning when separate.

> +	struct pci_dev *pdev __free(pci_dev_put) =
> +		pci_get_domain_bus_and_slot(err_info->segment,
> +					    err_info->bus,
> +					    devfn);
> +	return pdev;
> +}
> +
> +static void cxl_handle_prot_error(struct cxl_prot_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(sbdf_to_pci(err_info));
> +
> +	if (!pdev) {
> +		pr_err("Failed to find the CXL device\n");
> +		return;
> +	}
> +
> +	/*
> +	 * Internal errors of an RCEC indicate an AER error in an
> +	 * RCH's downstream port. Check and handle them in the CXL.mem
> +	 * device driver.
> +	 */
> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
> +		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
> +
> +	if (err_info->severity == AER_CORRECTABLE) {
> +		int aer = pdev->aer_cap;
> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +		struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
> +
> +		if (aer)
> +			pci_clear_and_set_config_dword(pdev,
> +						       aer + PCI_ERR_COR_STATUS,
> +						       0, PCI_ERR_COR_INTERNAL);
> +
> +		cxl_cor_error_detected(pdev);
> +
> +		pcie_clear_device_status(pdev);
> +	} else {
> +		cxl_do_recovery(pdev);
> +	}
> +}
> +
>  static void cxl_prot_err_work_fn(struct work_struct *work)
>  {
> +	struct cxl_prot_err_work_data wd;
> +
> +	while (cxl_prot_err_kfifo_get(&wd)) {
> +		struct cxl_prot_error_info *err_info = &wd.err_info;
> +
> +		cxl_handle_prot_error(err_info);

I'm not seeing value in the local variable.
Ignore if this code changes later and that makes more sense!

> +	}
>  }
>  
>  #else