[<prev] [next>] [day] [month] [year] [list]
Message-ID: <53f6d267-62af-4dad-8fa7-a2a497b22636@linux.dev>
Date: Fri, 1 Aug 2025 13:43:19 -0400
From: Sean Anderson <sean.anderson@...ux.dev>
To: Lorenzo Pieralisi <lpieralisi@...nel.org>,
Krzysztof WilczyĆski <kw@...ux.com>,
Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>,
linux-pci@...r.kernel.org
Cc: Rob Herring <robh@...nel.org>, Mahesh J Salgaonkar
<mahesh@...ux.ibm.com>, Oliver O'Halloran <oohall@...il.com>,
Bjorn Helgaas <bhelgaas@...gle.com>, Michal Simek <michal.simek@....com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: [BUG] pci: nwl: Unhandled AER correctable error
Hi,
AER correctable errors are pretty rare. I only saw one once before and
came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
interrupt messages") in response. I saw another today and,
unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
not sufficient to handle the IRQ. It gets immediately re-raised,
preventing the system from making any other progress. I suspect that it
needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
gets delivered to aer_irq, those registers never get tickled.
The underlying problem is that pcieport thinks that the IRQ is going to
be one of the MSIs or a legacy interrupt, but it's actually a native
interrupt:
CPU0 CPU1 CPU2 CPU3
42: 0 0 0 0 GICv2 150 Level nwl_pcie:misc
45: 0 0 0 0 nwl_pcie:legacy 0 Level PCIe PME, aerdrv
46: 25 0 0 0 nwl_pcie:msi 524288 Edge nvme0q0
47: 0 0 0 0 nwl_pcie:msi 524289 Edge nvme0q1
48: 0 0 0 0 nwl_pcie:msi 524290 Edge nvme0q2
49: 46 0 0 0 nwl_pcie:msi 524291 Edge nvme0q3
50: 0 0 0 0 nwl_pcie:msi 524292 Edge nvme0q4
In the above example, AER errors will trigger interrupt 42, not 45.
Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
so maybe nwl_pcie_misc_handler should be an interrupt controller
instead? But even then pcie_port_enable_irq_vec() won't figure out the
correct IRQ. Any ideas on how to fix this?
Additionally, any tips on actually triggering AER/PME stuff in a
consistent way? Are there any off-the-shelf cards for sending weird PCIe
stuff over a link for testing? Right now all I have
--Sean
# lspci -vv
00:00.0 PCI bridge: Xilinx Corporation Device d011 (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 45
Bus: primary=00, secondary=01, subordinate=0c, sec-latency=0
I/O behind bridge: 00000000-00000fff [size=4K]
Memory behind bridge: e0000000-e00fffff [size=1M]
Prefetchable memory behind bridge: [disabled]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [60] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend+
LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM not supported
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (ok), Width x2 (ok)
TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
RootCap: CRSVisible-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [10c v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [128 v1] Vendor Specific Information: ID=1234 Rev=1 Len=018 <?>
Capabilities: [140 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn+ NFERptEn+ FERptEn+
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Kernel driver in use: pcieport
Powered by blists - more mailing lists