lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <579cb233-4827-2d03-56ad-1b807a189ba8@amd.com>
Date:   Mon, 15 May 2023 12:20:42 -0700
From:   Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
To:     Lukas Wunner <lukas@...ner.de>
Cc:     linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
        Bjorn Helgaas <bhelgaas@...gle.com>, oohall@...il.com,
        Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
        Kuppuswamy Sathyanarayanan 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        Yazen Ghannam <yazen.ghannam@....com>,
        Fontenot Nathan <Nathan.Fontenot@....com>
Subject: Re: [PATCH 1/2] PCI: pciehp: Add support for OS-First Hotplug and
 AER/DPC

Hi Lukas,

On 5/11/2023 8:23 AM, Lukas Wunner wrote:
> On Wed, May 10, 2023 at 10:19:37PM +0200, Lukas Wunner wrote:
>> Below please find a patch which
>> sets the Surprise Down Error mask bit.  Could you test if this fixes
>> the issue for you?
> 
> Sorry, I failed to appreciate that pcie_capability_set_dword()
> can't be used to RMW the AER capability.  Replacement patch below.
> 
> -- >8 --
> 
> From: Lukas Wunner <lukas@...ner.de>
> Subject: [PATCH] PCI: pciehp: Disable Surprise Down Error reporting
> 
> On hotplug ports capable of surprise removal, Surprise Down Errors are
> expected and no reason for AER or DPC to spring into action.  Although
> a Surprise Down event might be caused by an error, software cannot
> discern that from regular surprise removal.
> 
> Any well-behaved BIOS should mask such errors, but Smita reports a case
> where hot-removing an Intel NVMe SSD [8086:0a54] from an AMD Root Port
> [1022:14ab] results in irritating AER log messages and a delay of more
> than 1 second caused by DPC handling:
> 
>    pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
>    pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
>    pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
>    pcieport 0000:00:01.4:   device [1022:14ab] error status/mask=00000020/04004000
>    pcieport 0000:00:01.4:    [ 5] SDES (First)
>    nvme nvme2: frozen state error detected, reset controller
>    pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
>    pcieport 0000:00:01.4: AER: subordinate device reset failed
>    pcieport 0000:00:01.4: AER: device recovery failed
>    pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
>    nvme2n1: detected capacity change from 1953525168 to 0
>    pci 0000:04:00.0: Removing from iommu group 49
> 
> Avoid by masking Surprise Down Errors on hotplug ports capable of
> surprise removal.
> 
> Mask them even if AER or DPC is handled by firmware because if hotplug
> control was granted to the operating system, it owns hotplug and thus
> Surprise Down events.  So firmware has no business reporting or reacting
> to them.
> 
> Reported-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
> Link: https://lore.kernel.org/all/20221101000719.36828-2-Smita.KoralahalliChannabasappa@amd.com/
> Signed-off-by: Lukas Wunner <lukas@...ner.de>

Thanks for the patch. I tested it and I notice that the AER status 
registers will still be set. I just don't see a DPC event with these 
settings.

I have logged in the status registers after the device is removed in
pciehp_handle_presence_or_link_change().

[  467.597119] PCI_ERR_COR_STATUS 0x0
[  467.597119] PCI_ERR_UNCOR_STATUS 0x20
[  467.597120] PCI_ERR_ROOT_STATUS 0x0
[  467.597121] PCI_EXP_DPC_RP_PIO_STATUS 0x10000
[  467.597122] PCI_STATUS 0x10
[  467.597123] PCI_EXP_DEVSTA 0x604

Section 6.2.3.2.2 in PCIe Spec v6.0 has also mentioned that:
"If an individual error is masked when it is detected, its error status 
bit is still affected, but no error reporting Message is sent to the 
Root Complex, and the error is not recorded in the Header Log, TLP 
Prefix Log, or First Error Pointer"..

So we think, masking will not help in not logging errors in status 
registers..

Let me know what you think..

Thanks,
Smita

> ---
>   drivers/pci/hotplug/pciehp_hpc.c | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index f8c70115b691..40a721f3b713 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -984,8 +984,9 @@ static inline int pcie_hotplug_depth(struct pci_dev *dev)
>   struct controller *pcie_init(struct pcie_device *dev)
>   {
>   	struct controller *ctrl;
> -	u32 slot_cap, slot_cap2, link_cap;
> +	u32 slot_cap, slot_cap2, link_cap, aer_cap;
>   	u8 poweron;
> +	u16 aer;
>   	struct pci_dev *pdev = dev->port;
>   	struct pci_bus *subordinate = pdev->subordinate;
>   
> @@ -1030,6 +1031,17 @@ struct controller *pcie_init(struct pcie_device *dev)
>   	if (dmi_first_match(inband_presence_disabled_dmi_table))
>   		ctrl->inband_presence_disabled = 1;
>   
> +	/*
> +	 * Surprise Down Errors are par for the course on Hot-Plug Surprise
> +	 * capable ports, so disable reporting in case BIOS left it enabled.
> +	 */
> +	aer = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
> +	if (aer && slot_cap & PCI_EXP_SLTCAP_HPS) {
> +		pci_read_config_dword(pdev, aer + PCI_ERR_UNCOR_MASK, &aer_cap);
> +		aer_cap |= PCI_ERR_UNC_SURPDN;
> +		pci_write_config_dword(pdev, aer + PCI_ERR_UNCOR_MASK, aer_cap);
> +	}
> +
>   	/* Check if Data Link Layer Link Active Reporting is implemented */
>   	pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap);
>   

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ