[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e2iu7w3sn7m4zwo6ork2mbfjcfixo5jn5ydshkefezsgtquvh6@kjdvxgiapbjj>
Date: Thu, 22 May 2025 17:17:26 +0530
From: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
To: Hans Zhang <18255117159@....com>
Cc: bhelgaas@...gle.com, tglx@...utronix.de, kw@...ux.com,
mahesh@...ux.ibm.com, oohall@...il.com, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH 0/4] pci: implement "pci=aer_panic"
On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
> The following series introduces a new kernel command-line option aer_panic
> to enhance error handling for PCIe Advanced Error Reporting (AER) in
> mission-critical environments. This feature ensures deterministic recover
> from fatal PCIe errors by triggering a controlled kernel panic when device
> recovery fails, avoiding indefinite system hangs.
>
> Problem Statement
> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
> traditional error recovery mechanisms may leave the system unresponsive
> indefinitely. This is unacceptable for high-availability environment
> requiring prompt recovery via reboot.
>
> Solution
> The aer_panic option forces a kernel panic on unrecoverable AER errors.
> This bypasses prolonged recovery attempts and ensures immediate reboot.
>
You should not panic the kernel when a PCI error occurs (even if it is a fatal
one). You should instead try to reset the root complex. For that you need this
series that got merged recently:
https://lore.kernel.org/all/20250508-pcie-reset-slot-v4-0-7050093e2b50@linaro.org
PS: You need to populate the slot_reset callback in your controller driver to
reset the controller in the event of a fatal AER error or link down.
- Mani
--
மணிவண்ணன் சதாசிவம்
Powered by blists - more mailing lists