linux-kernel - Re: [PATCH 0/4] pci: implement "pci=aer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1c21ec0b-ca89-4f7e-85f2-bdb48edb8055@163.com>
Date: Wed, 21 May 2025 22:54:33 +0800
From: Hans Zhang <18255117159@....com>
To: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@...ux.intel.com>,
 bhelgaas@...gle.com, tglx@...utronix.de, kw@...ux.com,
 manivannan.sadhasivam@...aro.org, mahesh@...ux.ibm.com
Cc: oohall@...il.com, linux-pci@...r.kernel.org,
 linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH 0/4] pci: implement "pci=aer_panic"



On 2025/5/21 00:09, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/19/25 7:41 AM, Hans Zhang wrote:
>>
>>
>> On 2025/5/19 22:21, Hans Zhang wrote:
>>>
>>>
>>> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>>>
>>>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>>>> The following series introduces a new kernel command-line option 
>>>>> aer_panic
>>>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>>>> mission-critical environments. This feature ensures deterministic 
>>>>> recover
>>>>> from fatal PCIe errors by triggering a controlled kernel panic when 
>>>>> device
>>>>> recovery fails, avoiding indefinite system hangs.
>>>>
>>>> Why would a device recovery failure lead to a system hang? Worst case
>>>> that device may not be accessible, right?  Any real use case?
>>>>
>>>
>>>
>>> Dear Sathyanarayanan,
>>>
>>> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually 
>>> SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will 
>>> design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all 
>>> customers have designed it this way.)  For example, when many mobile 
>>> phone SoCs of Qualcomm handle AER interrupts and there is a link 
>>> down, that is, a fatal problem occurs in the current PCIe physical 
>>> link, the system cannot recover.  At this point, a system restart is 
>>> needed to solve the problem.
>>>
>>> And our company design of SOC: http://radxa.com/products/orion/o6/, 
>>> it has 5 road PCIe port.
>>> There is also the same problem.  If there is a problem with one of 
>>> the PCIe ports, it will cause the entire system to hang.  So I hope 
>>> linux OS can offer an option that enables SOC manufacturers to choose 
>>> to restart the system in case of fatal hardware errors occurring in 
>>> PCIe.
>>>
>>> There are also products such as mobile phones and tablets.  We don't 
>>> want to wait until the battery is completely used up before 
>>> restarting them.
>>>
>>> For the specific code of Qualcomm, please refer to the email I sent.
>>>
>>
>>
>> Dear Sathyanarayanan,
>>
>> Supplementary reasons:
>>
>> drivers/pci/controller/cadence/pcie-cadence-host.c
>> cdns_pci_map_bus
>>     /* Clear AXI link-down status */
>>     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);
>>
>> https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
>>
>> If there has been a link down in this PCIe port, the register 
>> CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to 
>> continue.  This is different from Synopsys.
>>
>> If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD 
>> saving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it 
>> causes CPU Core1 to be unable to send TLP transfers and hang. This is 
>> a very extreme situation.
>> (The current Cadence code is Legacy PCIe IP, and the HPA IP is still 
>> in the upstream process at present.)
>>
>> Radxa O6 uses Cadence's PCIe HPA IP.
>> http://radxa.com/products/orion/o6/
>>
> 
> It sounds like a system level issue to me. Why not they rely on watchdog 
> to reboot for
> this case ?

Dear Sathyanarayanan,

Thank you for your reply. Yes, personally, I think it's also a problem 
at the system level. I conducted a local test. When I directly unplugged 
the EP device on the slot, the system would hang. It has been tested 
many times. Since we don't have a bus timeout response mechanism for 
PCIe, it hangs easily.

> 
> Even if you want to add this support, I think it is more appropriate to 
> add this to your
> specific PCIe controller driver.  I don't see why you want to add it 
> part of generic
> AER driver.
> 
Because we want to use the processing logic of the general AER driver. 
If the recovery is successful, there will be no problem. If the recovery 
fails, my original intention was to restart the system.

If added to the specific PCIe controller driver, a lot of repetitive AER 
processing logic will be written. So I was thinking whether the AER 
driver could be changed to be compiled as a KO module.


If this series is not reasonable, I'll drop it.


Best regards,
Hans

>>>
>>>>>
>>>>> Problem Statement
>>>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>>>> traditional error recovery mechanisms may leave the system 
>>>>> unresponsive
>>>>> indefinitely. This is unacceptable for high-availability environment
>>>>> requiring prompt recovery via reboot.
>>>>>
>>>>> Solution
>>>>> The aer_panic option forces a kernel panic on unrecoverable AER 
>>>>> errors.
>>>>> This bypasses prolonged recovery attempts and ensures immediate 
>>>>> reboot.
>>>>>
>>>>> Patch Summary:
>>>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, 
>>>>> explaining
>>>>> its purpose and usage.
>>>>>
>>>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>>>> management in PCI core.
>>>>>
>>>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the 
>>>>> panic
>>>>> mode is active.
>>>>>
>>>>> Panic Trigger: Modifies recovery logic to panic the system when 
>>>>> recovery
>>>>> fails and aer_panic is enabled.
>>>>>
>>>>> Impact
>>>>> Controlled Recovery: Reduces downtime by replacing hangs with 
>>>>> immediate
>>>>> reboots.
>>>>>
>>>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>>>
>>>>> Dependency: Requires CONFIG_PCIEAER.
>>>>>
>>>>> For example, in mobile phones and tablets, when there is a problem 
>>>>> with
>>>>> the PCIe link and it cannot be restored, it is expected to provide an
>>>>> alternative method to make the system panic without waiting for the
>>>>> battery power to be completely exhausted before restarting the system.
>>>>>
>>>>> ---
>>>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>>>> system when they are linked down.
>>>>>
>>>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>>>
>>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>>>
>>>>>
>>>>> Since the design schemes of each SOC manufacturer are different, 
>>>>> the AXI
>>>>> and other buses connected by PCIe do not have a design to prevent 
>>>>> hanging.
>>>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>>>> system needs to be restarted.
>>>>>
>>>>>
>>>>> Dear Mani,
>>>>>
>>>>> I wonder if you know how other SoCs of qcom handle FATAL errors 
>>>>> that occur
>>>>> in PCIe link.
>>>>> ---
>>>>>