lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5759e70d-1098-4a6a-bc7a-bcbad394d739@163.com>
Date: Tue, 20 May 2025 23:11:01 +0800
From: Hans Zhang <18255117159@....com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: bhelgaas@...gle.com, tglx@...utronix.de, kw@...ux.com,
 manivannan.sadhasivam@...aro.org, mahesh@...ux.ibm.com, oohall@...il.com,
 linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
 linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH 0/4] pci: implement "pci=aer_panic"



On 2025/5/20 06:03, Bjorn Helgaas wrote:
> On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
>> The following series introduces a new kernel command-line option aer_panic
>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>> mission-critical environments. This feature ensures deterministic recover
>> from fatal PCIe errors by triggering a controlled kernel panic when device
>> recovery fails, avoiding indefinite system hangs.
> 
> We try very hard not to add new kernel parameters.
> 
> It sounds like part of the problem is the use of SPI interrupts rather
> than the PCIe-architected INTx/MSI/MSI-X.  I'm not sure this warrants
> generic upstream code changes.  This might be something you need to
> maintain out-of-tree.
> 

Dear Bjorn,

This seems to have nothing to do with whether AER uses the 
INTx/MSI/MSI-X specified in the PCIe spec. Just like the example I gave 
earlier.

Our next-generation SOC has already converted AER interrupts into INTx 
and reported them to the GIC interrupt controller. But the following 
problems still cannot be solved.

```
Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
      /* Clear AXI link-down status */
      cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52

If there has been a link down in this PCIe port, the register
CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to
continue.  This is different from Synopsys.

If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving
files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes
CPU Core1 to be unable to send TLP transfers and hang.  This is a very
extreme situation.
(The current Cadence code is Legacy PCIe IP, and the HPA IP is still in
the upstream process at present.)

Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/
```


If we are in the out-of-tree maintenance corresponding driver, but in 
the file the arch/arm64 / configs/defconfig "CONFIG_PCIEAER=y", make we 
can't modify the AER common code. It also cannot be compiled to aer.ko

Because: CONFIG_PCIEAER can only be equal to y or n.
config PCIEAER
	bool "PCI Express Advanced Error Reporting support"
	depends on PCIEPORTBUS
	select RAS
	help
	  This enables PCI Express Root Port Advanced Error Reporting
	  (AER) driver support. Error reporting messages sent to Root
	  Port will be handled by PCI Express AER driver.

Furthermore, the API of AER common code cannot be used either, and many 
variables have not been exported either. If we write another set of AER 
drivers by ourselves, it will lead to a lot of repetitive processing 
logic code.


I believe that the Qualcomm platform and many other platforms also have 
similar problems.


So can we add a config? For example: CONFIG_PCIEAER_PANIC instead of 
command-line option aer_panic. Or the AER driver can be KO(tristate), so 
that our SOC manufacturer can modify the AER driver.


Best regards,
Hans

>> Problem Statement
>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>> traditional error recovery mechanisms may leave the system unresponsive
>> indefinitely. This is unacceptable for high-availability environment
>> requiring prompt recovery via reboot.
>>
>> Solution
>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>
>> Patch Summary:
>> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
>> its purpose and usage.
>>
>> Command-Line Handling: Implements pci=aer_panic parsing and state
>> management in PCI core.
>>
>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>> mode is active.
>>
>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>> fails and aer_panic is enabled.
>>
>> Impact
>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>> reboots.
>>
>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>
>> Dependency: Requires CONFIG_PCIEAER.
>>
>> For example, in mobile phones and tablets, when there is a problem with
>> the PCIe link and it cannot be restored, it is expected to provide an
>> alternative method to make the system panic without waiting for the
>> battery power to be completely exhausted before restarting the system.
>>
>> ---
>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>> system when they are linked down.
>>
>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>
>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>
>>
>> Since the design schemes of each SOC manufacturer are different, the AXI
>> and other buses connected by PCIe do not have a design to prevent hanging.
>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>> system needs to be restarted.
>>
>>
>> Dear Mani,
>>
>> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
>> in PCIe link.
>> ---
>>
>> Hans Zhang (4):
>>    pci: implement "pci=aer_panic"
>>    PCI/AER: Introduce aer_panic kernel command-line option
>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>
>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>   drivers/pci/pci.c                              |  2 ++
>>   drivers/pci/pci.h                              |  4 ++++
>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>
>>
>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>> -- 
>> 2.25.1
>>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ