lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ab986f2-6aa5-401a-aa21-e8b21f68eaad@amd.com>
Date:   Mon, 22 May 2023 15:23:57 -0700
From:   Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
To:     Lukas Wunner <lukas@...ner.de>
Cc:     linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
        Bjorn Helgaas <bhelgaas@...gle.com>, oohall@...il.com,
        Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
        Kuppuswamy Sathyanarayanan 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        Yazen Ghannam <yazen.ghannam@....com>,
        Fontenot Nathan <Nathan.Fontenot@....com>
Subject: Re: [PATCH v2 1/2] PCI: pciehp: Add support for async hotplug with
 native AER and DPC/EDR

Hi,

On 5/16/2023 3:10 AM, Lukas Wunner wrote:
> On Tue, Apr 18, 2023 at 09:05:25PM +0000, Smita Koralahalli wrote:
>> According to Section 6.7.6 of PCIe Base Specification [1], async removal
>> with DPC and EDR may be unexpected and may result in a Surprise Down error.
>> This error is just a side effect of hot remove. Most of the time, these
>> errors will be abstract to the OS as current systems rely on Firmware-First
>> model for AER and DPC, where the error handling (side effects of async
>> remove) and other necessary HW sequencing actions is taken care by the FW
>> and is abstract to the OS. However, FW-First model poses issues while
>> rolling out updates or fixing bugs as the servers need to be brought down
>> for firmware updates.
>>
>> Add support for async hot-plug with native AER and DPC/EDR. Here, OS is
>> responsible for handling async add and remove along with handling of AER
>> and DPC events which are generated as a side-effect of async remove.
> 
> I think you can probably leave out the details about Firmware-First.
> Pointing to 6.7.6 and the fact that Surprise Down Errors may occur as
> an expected side-effect of surprise removal is sufficient.  They should
> be treated as such.
> 
> You also want to point out what you're trying to achieve here, i.e. get
> rid of irritating log messages and a 1 sec delay while pciehp waits for
> DPC recovery.

Okay.

> 
> 
>> Please note that, I have provided explanation why I'm not setting the
>> Surprise Down bit in uncorrectable error mask register in AER.
>> https://lore.kernel.org/all/fba22d6b-c225-4b44-674b-2c62306135ed@amd.com/
> 
> Add an explanation to the commit message that masking Surprise Down Errors
> was explored as an alternative approach, but scrapped due to the odd
> behavior that masking only avoids the interrupt, but still records an
> error per PCIe r6.0.1 sec 6.2.3.2.2.  That stale error is going to be
> reported the next time some error other than Surprise Down is handled.
> 
> 

Okay.

>> Also, while testing I noticed PCI_STATUS and PCI_EXP_DEVSTA will be set
>> on an async remove and will not be cleared while the device is brought
>> down. I have included clearing them here in order to mask any kind of
>> appearance that there was an error and as well duplicating our BIOS
>> functionality. I can remove if its not necessary.
> 
> Which bits are set exactly?  Can you constrain the register write to
> those bits?
>
Hmm, I was mostly trying to follow the similar approach done for AER.
pci_aer_raw_clear_status(), where they clear status registers 
unconditionally. Also, thought might be better if we are dealing with 
legacy endpoints or so.

I see these bits set in status registers:
PCI_ERR_UNCOR_STATUS 0x20 (Surprise Down)
PCI_EXP_DPC_RP_PIO_STATUS 0x10000 (Memory Request received URCompletion)
PCI_EXP_DEVSTA 0x604 (fatal error detected)

> 
>> +static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>> +{
>> +	if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
>> +		return;
> 
> Emit an error message here?

Okay

> 
> 
>> +	/*
>> +	 * According to Section 6.7.6 of the PCIe Base Spec 6.0, since async
>> +	 * removal might be unexpected, errors might be reported as a side
>> +	 * effect of the event and software should handle them as an expected
>> +	 * part of this event.
>> +	 */
> 
> I'd move that code comment to into dpc_handler() above the
> "if (dpc_is_surprise_removal(pdev))" check.

Okay.

> 
> 
>> +	pci_aer_raw_clear_status(pdev);
>> +	pci_clear_surpdn_errors(pdev);
>> +
>> +	pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
>> +			      PCI_EXP_DPC_STATUS_TRIGGER);
>> +}
> 
> Do you need a "wake_up_all(&dpc_completed_waitqueue);" at the end
> of the function to wake up a pciehp handler waiting for DPC recovery?

I don't think so. The pciehp handler is however getting invoked 
simultaneously due to PDSC or DLLSC state change right.. Let me know if 
I'm missing anything here.

> 
> 
>> +static bool dpc_is_surprise_removal(struct pci_dev *pdev)
>> +{
>> +	u16 status;
>> +
>> +	pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
>> +
>> +	if (!(status & PCI_ERR_UNC_SURPDN))
>> +		return false;
>> +
> 
> You need an additional check for pdev->is_hotplug_bridge here.
> 
> And you need to read PCI_EXP_SLTCAP and check for PCI_EXP_SLTCAP_HPS.
> 
> Return false if either of them isn't set.

Return false, if PCI_EXP_SLTCAP isn't set only correct? 
PCI_EXP_SLTCAP_HPS should be disabled if DPC is enabled.

Implementation notes in 6.7.6 says that:
"The Hot-Plug Surprise (HPS) mechanism, as indicated by the Hot-Plug
Surprise bit in the Slot Capabilities Register being Set, is deprecated
for use with async hot-plug. DPC is the recommended mechanism for 
supporting async hot-plug."

Platform FW will disable the SLTCAP_HPS bit at boot time to enable async 
hotplug on AMD devices.

Probably check if SLTCAP_HPS bit is set and return false?

> 
> 
>> +	dpc_handle_surprise_removal(pdev);
>> +
>> +	return true;
>> +}
> 
> A function named "..._is_..." should not have any side effects.
> So move the dpc_handle_surprise_removal() call down into dpc_handler()
> before the "return IRQ_HANDLED;" statement.

Okay.

> 
> 
> What about ports which support AER but not DPC?  Don't you need to
> amend aer.c in that case?  I suppose you don't have hardware without
> DPC to test that?

Yeah right. Also, if DPC isn't supported we leave HPS bit set and we 
won't see the DPC event at that time.

Thanks,
Smita

> 
> Thanks,
> 
> Lukas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ