linux-kernel - Re: [PATCH] AER: PCIE CTO recovery handle fix

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <6gs5cvpqbbkudnlr7v57odgaxjyrare6nigrf2lkq22yljult2@z5jklzlmsdcq>
Date: Wed, 11 Jun 2025 22:30:38 +0530
From: Manivannan Sadhasivam <mani@...nel.org>
To: 孙利斌_Dio <dio.sun@...lame-tech.com>
Cc: "mahesh@...ux.ibm.com" <mahesh@...ux.ibm.com>, 
	"oohall@...il.com" <oohall@...il.com>, "bhelgaas@...gle.com" <bhelgaas@...gle.com>, 
	"linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>, "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 罗安_An <an.luo@...lame-tech.com>, 
	胡淮_Fernando <fernando.hu@...lame-tech.com>, 吴皓睿_Bill <bill.wu@...lame-tech.com>, 
	王鑫_Xin <xin.wang@...lame-tech.com>
Subject: Re: [PATCH] AER: PCIE CTO recovery handle fix

On Tue, Mar 04, 2025 at 07:07:05AM +0000, 孙利斌_Dio wrote:
> [EXTERNAL EMAIL]
> 
> From 5fc7b1a9e0f0bcfa14068c6358019ed1e3ffc6c6 Mon Sep 17 00:00:00 2001
> From: "dio.sun" <dio.sun@...lame-tech.com>
> Date: Wed, 26 Feb 2025 08:54:49 +0000
> Subject: [PATCH] AER: PCIE CTO recovery handle fix
> 

Looks like you forwarded this patch instead of submitting directly. Please fix
it.

>  - Non-fatal PCIe CTO is reportted to PCIE RC and it will be convertted to
>    AdvNonFatalErr automatically
>  - according to PCIE SPEC 6.2.3.2.4.4 Requester with Completion Timeout(
>    If the severity of the CTO is non-fatal, and the Requester elects to
>    attempt recovery by issuing a new request, the Requester must
>    first handle the currecnt error case as an Advisory Non-Fatal Error.).
>  - Current Kernel code does nothing when receiving an AdvNonFatalErr(
>    Correctable Error) and the device driver has no chance to handle this
>    error.
>  - Under this situation, sometimes system will hang when more
>    AdvNonFatalErr coming.
> 
> Signed-off-by: dio.sun <dio.sun@...lame-tech.com>
> ---
> drivers/pci/pcie/aer.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 508474e17183..5ddc990c6f42 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1154,7 +1154,21 @@ static void aer_recover_work_func(struct work_struct *work)
>                 ghes_estatus_pool_region_free((unsigned long)entry.regs,
>                                             sizeof(struct aer_capability_regs));
> 
> -               if (entry.severity == AER_NONFATAL)
> +               if (entry.severity == AER_CORRECTABLE) {
> +                       if (entry.regs->cor_status & PCI_ERR_COR_ADV_NFAT) {
> +                               pci_err(pdev, "%04x:%02x:%02x:%x advisory non-fatal error\n",
> +                                               entry.domain, entry.bus, PCI_SLOT(entry.devfn),
> +                                               PCI_FUNC(entry.devfn));
> +                               if (entry.regs->uncor_status & PCI_ERR_UNC_COMP_TIME) {
> +                                       pci_err(pdev, "%04x:%02x:%02x:%x completion timeout\n",
> +                                                       entry.domain, entry.bus,
> +                                                       PCI_SLOT(entry.devfn),
> +                                                       PCI_FUNC(entry.devfn));
> +                                       pcie_do_recovery(pdev, pci_channel_io_frozen,
> +                                                                        aer_root_reset);
> +                               }
> +                       }

Why the error is handled in aer_recover_work_func()? This looks like only gets
triggered from ghes_handle_aer() in drivers/acpi/apei/ghes.c.

I think it should be handled in pci_aer_handle_error(). Also, the error prints
should be sneaked into aer_print_error().

- Mani

-- 
மணிவண்ணன் சதாசிவம்