linux-kernel - Re: [BUG] pci: nwl: Unhandled AER correctable error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <emkdycgknxqoovellr3b7ugud4nmqzj5h4o454asafcdvpaczq@x3st2smyvegg>
Date: Tue, 5 Aug 2025 23:00:11 +0530
From: Manivannan Sadhasivam <mani@...nel.org>
To: Sean Anderson <sean.anderson@...ux.dev>
Cc: Bjorn Helgaas <helgaas@...nel.org>, 
	Lorenzo Pieralisi <lpieralisi@...nel.org>, Krzysztof Wilczyński <kw@...ux.com>, 
	Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>, linux-pci@...r.kernel.org, Rob Herring <robh@...nel.org>, 
	Mahesh J Salgaonkar <mahesh@...ux.ibm.com>, Oliver O'Halloran <oohall@...il.com>, 
	Bjorn Helgaas <bhelgaas@...gle.com>, Michal Simek <michal.simek@....com>, 
	Brian Norris <briannorris@...omium.org>, Minghuan Lian <minghuan.Lian@....com>, 
	Mingkai Hu <mingkai.hu@....com>, Roy Zang <roy.zang@....com>, Frank Li <Frank.Li@....com>, 
	Hou Zhiqiang <Zhiqiang.Hou@....com>, linux-arm-kernel@...ts.infradead.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [BUG] pci: nwl: Unhandled AER correctable error

On Tue, Aug 05, 2025 at 10:02:39AM GMT, Sean Anderson wrote:
> On 8/5/25 06:42, Manivannan Sadhasivam wrote:
> > On Mon, Aug 04, 2025 at 06:10:48PM GMT, Sean Anderson wrote:
> >> On 8/4/25 16:57, Bjorn Helgaas wrote:
> >> > [+cc more folks who might be interested in AER with non-standard
> >> > interrupts]
> >> > 
> >> > On Fri, Aug 01, 2025 at 01:43:19PM -0400, Sean Anderson wrote:
> >> >> Hi,
> >> >> 
> >> >> AER correctable errors are pretty rare. I only saw one once before and
> >> >> came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
> >> >> interrupt messages") in response. I saw another today and,
> >> >> unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
> >> >> not sufficient to handle the IRQ. It gets immediately re-raised,
> >> >> preventing the system from making any other progress. I suspect that it
> >> >> needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
> >> >> gets delivered to aer_irq, those registers never get tickled.
> >> >> 
> >> >> The underlying problem is that pcieport thinks that the IRQ is going to
> >> >> be one of the MSIs or a legacy interrupt, but it's actually a native
> >> >> interrupt:
> >> >> 
> >> >>            CPU0       CPU1       CPU2       CPU3       
> >> >>  42:          0          0          0          0     GICv2 150 Level     nwl_pcie:misc
> >> >>  45:          0          0          0          0  nwl_pcie:legacy   0 Level     PCIe PME, aerdrv
> >> >>  46:         25          0          0          0  nwl_pcie:msi 524288 Edge      nvme0q0
> >> >>  47:          0          0          0          0  nwl_pcie:msi 524289 Edge      nvme0q1
> >> >>  48:          0          0          0          0  nwl_pcie:msi 524290 Edge      nvme0q2
> >> >>  49:         46          0          0          0  nwl_pcie:msi 524291 Edge      nvme0q3
> >> >>  50:          0          0          0          0  nwl_pcie:msi 524292 Edge      nvme0q4
> >> >> 
> >> >> In the above example, AER errors will trigger interrupt 42, not 45.
> >> >> Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
> >> >> so maybe nwl_pcie_misc_handler should be an interrupt controller
> >> >> instead? But even then pcie_port_enable_irq_vec() won't figure out the
> >> >> correct IRQ. Any ideas on how to fix this?
> >> >> 
> >> >> Additionally, any tips on actually triggering AER/PME stuff in a
> >> >> consistent way? Are there any off-the-shelf cards for sending weird PCIe
> >> >> stuff over a link for testing? Right now all I have 
> >> > 
> >> > This is definitely a problem.  We have had some discussion about this
> >> > in the past, but haven't quite achieved critical mass to solve this in
> >> > a generic way.  Here are some links:
> >> > 
> >> >   https://lore.kernel.org/linux-pci/20250702223841.GA1905230@bhelgaas/t/#u
> >> >   https://lore.kernel.org/linux-pci/1464242406-20203-1-git-send-email-po.liu@nxp.com/
> >> 
> >> Thanks for the links. Toggling PERST does seem to reliably cause
> >> correctable errors (however "correctable" they may actually be in
> >> practice). With the patch I posted on the other branch of this chain I
> >> now get
> >> 
> >> [   43.041610] pcieport 0000:00:00.0: AER: Multiple Corrected error message received from 0000:00:00.0
> >> [   43.050693] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> >> [   43.061477] pcieport 0000:00:00.0:   device [10ee:d011] error status/mask=00000001/0000e000
> >> [   43.069842] pcieport 0000:00:00.0:    [ 0] RxErr                 
> >> 
> >> Whether or not that's the right fix, at least I can test things :)
> > 
> > Could you please check if INTX is working for AER? You can just pass the cmdline
> > parameter, "pcie_pme=nomsi" and observe if the IRQ is getting triggered.
> 
> I don't really understand what you want me to check. As shown above, pme
> and aer are already assigned to INTA, not an MSI. This of course never
> gets triggered.
> 

Sorry, my bad. I misread the MSI interrupts assigned to NVMe queues as AER.

> Figure 30-5 in UG1085 [1] shows the interrupt architecture, and I think
> it's clear from that diagram that there's no pathway for root port
> errors to trigger an MSI or a legacy interrupt.
> 

Then we really need to plug aer_irq with the platform interrupt with the help of
the controller driver. It is not on top of my priority list, so someone with the
bandwidth and motivation should look into it.

- Mani

-- 
மணிவண்ணன் சதாசிவம்