lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <53f6d267-62af-4dad-8fa7-a2a497b22636@linux.dev>
Date: Fri, 1 Aug 2025 13:43:19 -0400
From: Sean Anderson <sean.anderson@...ux.dev>
To: Lorenzo Pieralisi <lpieralisi@...nel.org>,
 Krzysztof WilczyƄski <kw@...ux.com>,
 Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>,
 linux-pci@...r.kernel.org
Cc: Rob Herring <robh@...nel.org>, Mahesh J Salgaonkar
 <mahesh@...ux.ibm.com>, Oliver O'Halloran <oohall@...il.com>,
 Bjorn Helgaas <bhelgaas@...gle.com>, Michal Simek <michal.simek@....com>,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: [BUG] pci: nwl: Unhandled AER correctable error

Hi,

AER correctable errors are pretty rare. I only saw one once before and
came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc
interrupt messages") in response. I saw another today and,
unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is
not sufficient to handle the IRQ. It gets immediately re-raised,
preventing the system from making any other progress. I suspect that it
needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never
gets delivered to aer_irq, those registers never get tickled.

The underlying problem is that pcieport thinks that the IRQ is going to
be one of the MSIs or a legacy interrupt, but it's actually a native
interrupt:

           CPU0       CPU1       CPU2       CPU3       
 42:          0          0          0          0     GICv2 150 Level     nwl_pcie:misc
 45:          0          0          0          0  nwl_pcie:legacy   0 Level     PCIe PME, aerdrv
 46:         25          0          0          0  nwl_pcie:msi 524288 Edge      nvme0q0
 47:          0          0          0          0  nwl_pcie:msi 524289 Edge      nvme0q1
 48:          0          0          0          0  nwl_pcie:msi 524290 Edge      nvme0q2
 49:         46          0          0          0  nwl_pcie:msi 524291 Edge      nvme0q3
 50:          0          0          0          0  nwl_pcie:msi 524292 Edge      nvme0q4

In the above example, AER errors will trigger interrupt 42, not 45.
Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS,
so maybe nwl_pcie_misc_handler should be an interrupt controller
instead? But even then pcie_port_enable_irq_vec() won't figure out the
correct IRQ. Any ideas on how to fix this?

Additionally, any tips on actually triggering AER/PME stuff in a
consistent way? Are there any off-the-shelf cards for sending weird PCIe
stuff over a link for testing? Right now all I have 

--Sean

# lspci -vv
00:00.0 PCI bridge: Xilinx Corporation Device d011 (prog-if 00 [Normal decode])
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 45
	Bus: primary=00, secondary=01, subordinate=0c, sec-latency=0
	I/O behind bridge: 00000000-00000fff [size=4K]
	Memory behind bridge: e0000000-e00fffff [size=1M]
	Prefetchable memory behind bridge: [disabled]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] Express (v2) Root Port (Slot-), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend+
		LnkCap:	Port #0, Speed 5GT/s, Width x2, ASPM not supported
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s (ok), Width x2 (ok)
			TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
		RootCap: CRSVisible-
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, ARIFwd-
			 AtomicOpsCtl: ReqEn- EgressBlck-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [10c v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [128 v1] Vendor Specific Information: ID=1234 Rev=1 Len=018 <?>
	Capabilities: [140 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
		RootCmd: CERptEn+ NFERptEn+ FERptEn+
		RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
			 FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
		ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
	Kernel driver in use: pcieport

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ