[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <qvxxjffvet7lukrjina44y52bsim4grvjl3avtzjlh2rgncyl5@4z5k36wtad6u>
Date: Thu, 2 Oct 2025 02:08:09 -0700
From: Breno Leitao <leitao@...ian.org>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: pandoh@...gle.com, linux-pci@...r.kernel.org,
Karolina Stolarek <karolina.stolarek@...cle.com>, Weinan Liu <wnliu@...gle.com>,
Martin Petersen <martin.petersen@...cle.com>, Ben Fuller <ben.fuller@...cle.com>,
Drew Walton <drewwalton@...rosoft.com>, Anil Agrawal <anilagrawal@...a.com>,
Tony Luck <tony.luck@...el.com>, Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>,
Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@...ux.intel.com>, Lukas Wunner <lukas@...ner.de>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>, Sargun Dhillon <sargun@...a.com>,
"Paul E . McKenney" <paulmck@...nel.org>, Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>, Kai-Heng Feng <kaihengf@...dia.com>,
Keith Busch <kbusch@...nel.org>, Robert Richter <rrichter@....com>,
Terry Bowman <terry.bowman@....com>, Shiju Jose <shiju.jose@...wei.com>,
Dave Jiang <dave.jiang@...el.com>, linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
Bjorn Helgaas <bhelgaas@...gle.com>, kernel-team@...a.com, gustavold@...il.com
Subject: Re: [PATCH v8 18/20] PCI/AER: Ratelimit correctable and non-fatal
error logging
Hello Bjorn,
On Wed, Oct 01, 2025 at 04:38:36PM -0500, Bjorn Helgaas wrote:
> On Fri, Aug 01, 2025 at 06:16:29AM -0700, Breno Leitao wrote:
> > So, somehow the PCI was released at this point?
> For completeness, would you mind attaching the output of
> "sudo lspci -vvs 00:00.0" for one of these systems?
Sorry, I don't have this information anymore from the previous stack.
But, I've just a fresh instance of the issue that seems related to the one I
gave you.
Please let me know if you anything else from this instance and I will capture
for you.
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]: Error 0, type: corrected
{1}[Hardware Error]: section_type: PCIe error
{1}[Hardware Error]: port_type: 4, root port
{1}[Hardware Error]: version: 3.0
{1}[Hardware Error]: command: 0x0540, status: 0x0010
{1}[Hardware Error]: device_id: 0000:00:00.0
{1}[Hardware Error]: slot: 255
{1}[Hardware Error]: secondary_bus: 0x00
{1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2020
{1}[Hardware Error]: class_code: 060000
{1}[Hardware Error]: aer_cor_status: 0x00000040, aer_cor_mask: 0x00000000
{1}[Hardware Error]: aer_uncor_status: 0x00000000, aer_uncor_mask: 0x00100000
{1}[Hardware Error]: aer_uncor_severity: 0x00062030
{1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000054: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
KASAN: null-ptr-deref in range [0x00000000000002a0-0x00000000000002a7]
CPU: 0 UID: 0 PID: 11 Comm: kworker/0:1 Kdump: loaded Tainted: G S E N 6.16.1-0_debug_rc12_0_g1031909ae07a #1 PREEMPT(none)
Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE, [N]=TEST
Hardware name: Foo
Workqueue: events aer_recover_work_func
RIP: 0010:___ratelimit+0x36/0x750
Code: ec 38 49 89 f7 48 89 fb 65 48 8b 05 fc da 84 03 48 89 44 24 30 48 ba 00 00 00 00 00 fc ff df 48 83 c7 40 48 89 f8 48 c1 e8 03 <0f> b6 04 10 84 c0 0f 85 17 05 00 00 44 8b 63 40 4c 8d 6b 44 4c 89
RSP: 0018:ffff8882893c7a28 EFLAGS: 00010202
RAX: 0000000000000054 RBX: 0000000000000260 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: ffffffff84b1cedb RDI: 00000000000002a0
RBP: 0000000000000002 R08: ffffffff87093aff R09: 1ffffffff0e1275f
R10: dffffc0000000000 R11: fffffbfff0e12760 R12: 0000000000000002
R13: dffffc0000000000 R14: ffff8882953ce000 R15: ffffffff84b1cedb
FS: 0000000000000000(0000) GS:ffff88905e498000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd58a9f2000 CR3: 00000002bd08f002 CR4: 00000000007726f0
PKRU: 55555554
Call Trace:
<TASK>
? trace_aer_event+0x6a/0x1f0
pci_print_aer+0x30f/0x6c0
aer_recover_work_func+0x13b/0x200
? process_scheduled_works+0x959/0x1450
process_scheduled_works+0x9fe/0x1450
worker_thread+0x8fd/0xd10
? __kthread_parkme+0xd0/0x1c0
? _raw_spin_unlock_irqrestore+0x42/0xa0
? __kthread_parkme+0x92/0x1c0
kthread+0x50c/0x630
? pr_cont_work+0x4a0/0x4a0
? kthread_blkcg+0xa0/0xa0
ret_from_fork+0x1c6/0x390
? kthread_blkcg+0xa0/0xa0
ret_from_fork_asm+0x11/0x20
</TASK>
Modules linked in: sch_fq(E) tls(E) act_gact(E) tcp_diag(E) inet_diag(E) cls_bpf(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) skx_edac(E) skx_edac_common(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) mlx5_ib(E) iTCO_wdt(E) iTCO_vendor_support(E) xhci_pci(E) kvm(E) evdev(E) irqbypass(E) acpi_cpufreq(E) ib_uverbs(E) i2c_i801(E) xhci_hcd(E) ipmi_si(E) i2c_smbus(E) mlx5_fwctl(E) ipmi_devintf(E) wmi(E) ipmi_msghandler(E) button(E) sch_fq_codel(E) bpf_preload(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) efivarfs(E)
And the lspci as you've requested
# sudo lspci -vvs 00:00.0
00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 04)
Subsystem: Intel Corporation Device 0000
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 0
Capabilities: [90] Express (v2) Root Port (Slot-), IntMsgNum 0
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+ TEE-IO-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM not supported
ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown, Width x0
TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
RootCap: CRSVisible-
RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range BCD, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd-
AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [e0] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?>
Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
Capabilities: [250 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: LaneErr at lane: 0 1 2
Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?>
Capabilities: [300 v1] Vendor Specific Information: ID=0008 Rev=0 Len=038 <?>
thanks
--breno
Powered by blists - more mailing lists