[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <31064f33-2a34-1855-6729-a4efe32b10e8@amd.com>
Date: Tue, 14 Mar 2023 12:31:11 -0700
From: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
To: Lukas Wunner <lukas@...ner.de>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
Bjorn Helgaas <bhelgaas@...gle.com>, oohall@...il.com,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@...ux.intel.com>,
Yazen Ghannam <yazen.ghannam@....com>,
Fontenot Nathan <Nathan.Fontenot@....com>
Subject: Re: [PATCH 1/2] PCI: pciehp: Add support for OS-First Hotplug and
AER/DPC
Hi,
Please let me know if I should redo this on latest tree and discuss my
comments there.
Thanks,
Smita
On 2/14/2023 1:33 AM, Smita Koralahalli wrote:
> On 11/4/2022 3:15 AM, Lukas Wunner wrote:
>> On Tue, Nov 01, 2022 at 12:07:18AM +0000, Smita Koralahalli wrote:
>>> The implementation is as follows: On an async remove a DPC is
>>> triggered as
>>> a side-effect along with an MSI to the OS. Determine it's an async
>>> remove
>>> by checking for DPC Trigger Status in DPC Status Register and Surprise
>>> Down Error Status in AER Uncorrected Error Status to be non-zero. If
>>> true,
>>> treat the DPC event as a side-effect of async remove, clear the error
>>> status registers and continue with hot-plug tear down routines. If not,
>>> follow the existing routine to handle AER/DPC errors.
>> Instead of having the OS recognize and filter Surprise Down events,
>> it would also be possible to simply set the Surprise Down bit in the
>> Uncorrectable Error Mask Register. This could be constrained to
>> Downstream Ports capable of surprise removal, i.e. those where the
>> is_hotplug_bridge in struct pci_dev is set. And that check and the
>> register change could be performed in pci_dpc_init().
>>
>> Have you considered such an alternative approach? If you have, what
>> was the reason to prefer the more complex solution you're proposing?
>
> By setting the Surprise down bit in Uncorrectable Error Mask register
> we will
> not get a DPC event. What I know so far is, we cannot set this bit at
> run-time
> after we determine its a surprise down event or probably I don't know
> enough how to do it! (once an pciehp interrupt is triggered..).
>
> And setting this bit at initialization might not trigger true DPC
> events..
>
> Second thing, is masking Surprise Down bit has no impact on logging
> errors in
> AER registers.
>
> So, I think that approach probably will not resolve the issue of
> clearing the logs
> in AER registers and complicate things while differentiating true
> errors vs surprise
> down events. Please correct me if I'm wrong!!
>
> I did few testing after I read your comments. What I realized is
> that, these DPC
> events (side effects of hot remove) are actually benign on AMD
> systems. On a hot
> remove I see a Presence Detect change and a DPC event. This PD state
> change
> will trigger a pciehp isr and calls
> pciehp_handle_presence_or_link_change()
> and disables the slot normally. So essentially, this patch will boil
> down to the point
> to clearing the logs in AER registers and also handling those error
> messages ("device
> recovery failed"....) in dmesg which might confuse users on a hot remove.
> What do you think?
>
> Now, I'm not sure whether there will be a PD state change across other
> systems on a
> hot remove when AER/DPC is native (OS First) in which case we should
> call the
> pciehp_disable_slot() from dpc handler as well.. Any inputs would be
> appreciated here..
>>
>>
>>> +static void pci_clear_surpdn_errors(struct pci_dev *pdev)
>>> +{
>>> + u16 reg16;
>>> + u32 reg32;
>>> +
>>> + pci_read_config_dword(pdev, pdev->dpc_cap +
>>> PCI_EXP_DPC_RP_PIO_STATUS, ®32);
>>> + pci_write_config_dword(pdev, pdev->dpc_cap +
>>> PCI_EXP_DPC_RP_PIO_STATUS, reg32);
>>> +
>>> + pci_read_config_word(pdev, PCI_STATUS, ®16);
>>> + pci_write_config_word(pdev, PCI_STATUS, reg16);
>>> +
>>> + pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, ®16);
>>> + pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, reg16);
>>> +}
>> I don't understand why PCI_STATUS and PCI_EXP_DEVSTA need to be
>> touched here?
>
> This is just to mask any kind of appearance that there was an error since
> the errors would have been induced by the hot plug event (just
> duplicating
> our BIOS functionality here..). But, please let me know if OS is already
> handling the things here and if it is not required.
>>
>>
>>> +static void pciehp_handle_surprise_removal(struct pci_dev *pdev)
>> Since this function is located in dpc.c and is strictly called from
>> other functions in the same file, it should be prefixed dpc_, not
>> pciehp_.
>
> Sure, will fix in v2.
>>
>>
>>> + /*
>>> + * According to Section 6.13 and 6.15 of the PCIe Base Spec 6.0,
>>> + * following a hot-plug event, clear the ARI Forwarding Enable bit
>>> + * and AtomicOp Requester Enable as its not determined whether the
>>> + * next device inserted will support these capabilities. AtomicOp
>>> + * capabilities are not supported on PCI Express to PCI/PCI-X
>>> Bridges
>>> + * and any newly added component may not be an ARI device.
>>> + */
>>> + pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL2,
>>> + (PCI_EXP_DEVCTL2_ARI |
>>> PCI_EXP_DEVCTL2_ATOMIC_REQ));
>> That looks like a reasonable change, but it belongs in a separate
>> patch. And I think it should be performed as part of (de-)enumeration,
>> not as part of DPC error handling. What about Downstream Ports which
>> are not DPC-capable, I guess the bits should be cleared as well, no?
>
> AFAIK, DPC will work on all AMD root ports. I'm not sure how could we
> handle
> on a per port basis if the bridges/ports downstream to root ports
> don't support
> DPC..
>>
>> How about clearing the bits in pciehp_unconfigure_device()?
>
> Okay.
>
> Thanks,
> Smita
>>
>> Thanks,
>>
>> Lukas
>
>
Powered by blists - more mailing lists