lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ce191d03-c228-4f1e-b96a-0388220bc586@amd.com>
Date: Mon, 24 Jun 2024 13:21:59 -0500
From: Terry Bowman <Terry.Bowman@....com>
To: Dan Williams <dan.j.williams@...el.com>, ira.weiny@...el.com,
 dave@...olabs.net, dave.jiang@...el.com, alison.schofield@...el.com,
 ming4.li@...el.com, vishal.l.verma@...el.com, jim.harris@...sung.com,
 ilpo.jarvinen@...ux.intel.com, ardb@...nel.org,
 sathyanarayanan.kuppuswamy@...ux.intel.com, linux-cxl@...r.kernel.org,
 linux-kernel@...r.kernel.org, Yazen.Ghannam@....com, Robert.Richter@....com
Cc: Bjorn Helgaas <bhelgaas@...gle.com>, linux-pci@...r.kernel.org
Subject: Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic
 notifier for reporting AER internal errors

Hi Dan,

I added responses inline below.

On 6/21/24 14:36, Dan Williams wrote:
> Terry Bowman wrote:
>> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
>> does not implement an AER correctable handler (CE) but does implement the
>> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
>> in that it only checks for frozen error state and returns the next step
>> for recovery accordingly.
>>
>> As a result, port devices relying on AER correctable internal errors (CIE)
>> and AER uncorrectable internal errors (UIE) will not be handled. Note,
>> the PCIe spec indicates AER CIE/UIE can be used to report implementation
>> specific errors.[1]
>>
>> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
>> are examples of devices using the AER CIE/UIE for implementation specific
>> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
>> report CXL RAS errors.[2]
>>
>> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
>> notifier to report CIE/UIE errors to the registered functions. This will
>> require adding a CE handler and updating the existing UCE handler.
>>
>> For the UCE handler, the CXL spec states UIE errors should return need
>> reset: "The only method of recovering from an Uncorrectable Internal Error
>> is reset or hardware replacement."[1]
>>
>> [1] PCI6.0 - 6.2.10 Internal Errors
>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>              Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@....com>
>> Cc: Bjorn Helgaas <bhelgaas@...gle.com>
>> Cc: linux-pci@...r.kernel.org
>> ---
>>  drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>>  drivers/pci/pcie/portdrv.h |  2 ++
>>  2 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>> index 14a4b89a3b83..86d80e0e9606 100644
>> --- a/drivers/pci/pcie/portdrv.c
>> +++ b/drivers/pci/pcie/portdrv.c
>> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>>  	u32 service;
>>  };
>>  
>> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
>> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
>> +
>>  /**
>>   * release_pcie_device - free PCI Express port service device structure
>>   * @dev: Port service device to release
>> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>>  					pci_channel_state_t error)
>>  {
>> +	if (dev->aer_cap) {
>> +		u32 status;
>> +
>> +		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
>> +				      &status);
>> +
>> +		if (status & PCI_ERR_UNC_INTN) {
>> +			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>> +						   AER_FATAL, (void *)dev);
>> +			return PCI_ERS_RESULT_NEED_RESET;
>> +		}
>> +	}
>> +
> 
> Oh, this is a finer grained  / lower-level location than I was
> expecting. I was expecting that the notifier was just conveying the port
> interrupt notification to a driver that knew how to take the next step.
> This pcie_portdrv_error_detected() is a notification that is already
> "downstream" of the AER notification.
> 

My intent was to implement the UIE/CIE "implementation specific" behavior as 
mentioned in the PCI spec. This included allowing port devices to be notified if 
needed. This plan is not ideal but works within the PCI portdrv situation
and before we can introduce a CXL specific portdriver.

> If PCIe does not care about CIE and UIE then don't make it care, but
> redirect the notifications to the CXL side that may care.
> 
> Leave the portdrv handlers PCIe native as much as possible.
> 
> Now, I have not thought through the full implications of that
> suggestion, but for now am reacting to this AER -> PCIe err_handler ->
> CXL notfier as potentially more awkward than AER -> CXL notifier. It's a
> separate error handling domain that the PCIe side likely does not want
> to worry about. PCIe side is only responsible for allowing CXL to
> register for the notifications beacuse the AER interrupt is shared.

Hmmm, this sounds like either option#2 or introducing a CXL portdrv service 
driver. 

Thanks for the reviews and please let me know which option you 
would like me to purse.

Regards,
Terry


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ