[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8edcc464-c467-4e83-a93b-19b92a2cf193@kernel.org>
Date: Tue, 7 Oct 2025 16:33:53 -0500
From: Mario Limonciello <superm1@...nel.org>
To: Bert Karwatzki <spasswolf@....de>, linux-kernel@...r.kernel.org
Cc: linux-next@...r.kernel.org, linux-stable@...r.kernel.org,
regressions@...ts.linux.dev, linux-pci@...r.kernel.org,
linux-acpi@...r.kernel.org, Christian König
<christian.koenig@....com>, "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>
Subject: Re: [REGRESSION 00/04] Crash during resume of pcie bridge
On 10/6/25 7:09 AM, Bert Karwatzki wrote:
> Since linux version v6.15 I experience random crashes on my MSI Alpha 15 Laptop
> running debian trixie (amd64). The first such crash happened about in the midth
> of june, and as there were no useful log messages and even using netconsole
> gave no useful message I suspected faulty hardware. So I ran memtest86+ and
> found a faulty address line and replaced the memory (unfortunately 64G to 16G).
> But the crashes occured again and so I did a thorough investigation.
>
> The crashes occur after 30min to 33h (yes, hours) of uptime and consist of a
> sudden reboot after which the PCI bridge at 00:02.4 and the nvme device
> connected to it are missing. If there's sound running during the crash then the
> first sign of the crash is the sound looping like a broken record for about 2s,
> after which the reboot happens. With the missing nvme device the reboot drops to
> a rescue shell. Using "shutdown -h now" from that shell and starting the laptop
> with the power button restores the missing PCI bridge and nvme device.
>
> The hardware is the following (it's a dual GPU laptop where the GUI
> runs on the built-in GPU):
>
> $ cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 25
> model : 80
> model name : AMD Ryzen 7 5800H with Radeon Graphics
> stepping : 0
> microcode : 0xa50000c
> cpu MHz : 3394.238
> cache size : 512 KB
> physical id : 0
> siblings : 16
> core id : 0
> cpu cores : 8
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 16
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
> bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso ibpb_no_ret
> bogomips : 6388.57
> TLB size : 2560 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
>
> $ lspci -nn
> 00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
> 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
> 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
> 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
> 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
> 00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
> 00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
> 00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
> 00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
> 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
> 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
> 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
> 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
> 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 [1022:166a]
> 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 [1022:166b]
> 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 [1022:166c]
> 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 [1022:166d]
> 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 [1022:166e]
> 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 [1022:166f]
> 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 [1022:1670]
> 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 [1022:1671]
> 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
> 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
> 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
> 03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
> 04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
> 05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
> 06:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] [2646:5013] (rev 01)
> 07:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P1 NVMe PCIe SSD[Frampton] [c0a9:2263] (rev 03)
> 08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
> 08:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
> 08:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
> 08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
> 08:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
> 08:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Audio Coprocessor [1022:15e2] (rev 01)
> 08:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller [1022:15e3]
> 08:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4]
>
> These devices are attached to the PCI bus like this:
>
> $ lspci -t
> -[0000:00]-+-00.0
> +-00.2
> +-01.0
> +-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0 // This is the bridge which causes the crash
> | \-00.1
> +-02.0
> +-02.1-[04]----00.0
> +-02.2-[05]----00.0
> +-02.3-[06]----00.0
> +-02.4-[07]----00.0 // These are the bridge and nvme device which disappear after the crash.
> +-08.0
> +-08.1-[08]--+-00.0
> | +-00.1
> | +-00.2
> | +-00.3
> | +-00.4
> | +-00.5
> | +-00.6
> | \-00.7
> +-14.0
> +-14.3
> +-18.0
> +-18.1
> +-18.2
> +-18.3
> +-18.4
> +-18.5
> +-18.6
> \-18.7
>
> I tried to bisect this between v6.14 and v6.15 but due to the wildly varying time
> it takes to trigger the bug the bisections were not successful. Nevertheless they
> gave lots of data about affected and non-affected version of the linux kernel,
> and it's quite likely that version v6.14 is indeed free of the bug.
>
> Here's an almost complete list of tested versions:
> (Somewhat) sorted (by kernel version, 6.14.0-rc* kernels are from attempted bisections
> between v6.14 and v6.15)
> v6.14.0 no crash after 16h
> v6.14.11 no crash after 7.5h
> 6.14.0-rc1-bisect-00003-g541ddf31e300 booted 12:24, 22.8.2025, no crash after {48h, 17h}
> 6.14.0-rc1-mystery-00134-gcc28c0e5e725 booted 11:42, 5.8.2025, no crash after 10.5h
> 6.14.0-rc1-mystery-00198-gd7f6f07ecec9 booted 22:27, 5.8.2025, no crash after 12h
> 6.14.0-rc4-mystery-01022-gab498828fad7 booted 21:04, 3.8.2025, no crash after {14h, 24h}
> 6.14.0-rc4-mystery-01427-g7547510d4a91 booted 11:11, 4.8.2025, no crash after {13h, 23h}
> 6.14.0-rc6-mystery-01641-g0f04462874e1 booted 00:26, 5.8.2025, no crash after {11h, 24h}
> 6.14.0-mystery-00826-g327ecdbc0fda no crash after {16h, 17h, 6.5h}
> ############## here the crashes start (time to each crash, crashes do not always occur) ########
> 6.14.0-bisect-01053-gebfb94d87b35 booted 10:15, 20.8.2025 crash after ~33h
> 6.14.0-mystery-09584-g7d06015d936c crash 20.44 3.8.2025 after 7h
> 6.14.0-mystery-11703-geb0ece16027f crash 13.22 3.8.2025 after 1.75h
> 6.15.0 crashed around 15-17.6.2025, unknown uptime (This is the first crash!)
> 6.15.0-nort crash after 6.75h
> 6.16-rc4 (next-20250627) crash after ~4h
> 6.16-rc4 (next-20250630) crash after ~5h
> 6.16-rc4 (next-20250703) crash after ~2.5h (sound buffer repeated for ~1s before restarting)
> 6.16-rc6 (next-20250718) crash after {2h, 2h}
> 6.16-rc7 (next-20250721) crash after {~30min, 2h, 5.5h}
> 6.16.0-nortlockdep crash after 4h
> 6.17.0-rc4-next-20250902-master booted 8:36, 3.9.2025, crash after ~3.5h
> 6.17.0-rc5-next-20250908-master booted 10:25, 9.9.2025, crash after {~6.5h, 14h}
> 6.17.0-rc6-next-20250917-acpidebug booted 12:41, 20.9.2025, crash 15:22 20.8.2025 (~3h, 647 GPP notifies)
> The versions below contain additional debugging printk()s and dev_info()s.
> The details of these debugging statements are explained below.
> 6.17.0-rc6-next-20250917-gpudebug-00018-g7a38b625a003 booted 12:58, 26.9.2025, crash 12:01, 27.9.2025 (~23h, 1500 GPP notifies)
> 6.17.0-rc6-next-20250917-gpudebug-00021-gab98d880e3c8 booted 23:52, 28.9.2025, crash 2:25, 30.9.2025 (26.5h, 1504GPP0, 889GPP2)
> 6.17.0-rc6-next-20250917-gpudebug-00024-g5c6b49b810db booted 9:10, 2.10.2025, 60h 3093 GPP0 notifies without crash (too many printk()s?)
> 6.17.0-rc6-next-20250917-gpudebug-00028-gf99cf81b1da7 booted 21:21, 4.10.2025 first try stopped after 77min due to hung tasks
> 6.17.0-rc6-next-20250917-gpudebug-00028-gf99cf81b1da7 booted 23:37, 4.10.2025 crash 4:52, 6.10.2025 (~27.5h)
> 6.17.0-rc6-next-20250917-gpudebug-00029-ge797f42363d1 booted 13:00, 6.10.2025 currently testing
>
> As the bisections were not succesfull I tried to monitor the crash using
> netconsole and CONFIG_ACPI_DEBUG and "acpi.debug_layer=0xf acpi.debug_level=0x107"
> as command line parameters. With this the last message on netconsole before
> the crash is usually:
>
> [21465.639279] [ T251] evmisc-0132 ev_queue_notify_reques: Dispatching Notify on [GPP0] (Device) Value 0x00 (Bus Check) Node 00000000f81f36b8
>
> GPP0 is the ACPI name of this PCI bridge (at least that's my best guess):
>
> 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
>
> to which the discrete GPU is connected
>
> 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
>
> via the pci express switch
>
> 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
> 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
>
> While the GUI (xfce on xorg) on my laptop runs on the built-in GPU the discrete
> GPU usually wakes up quite often, e.g. when a window is opened or when scrolling down on youtube.
>
> A somewhat reliable method to generate GPP0 notifies is putting on a youtube
> video and the periodically starting evolution with this script:
>
> #!/bin/bash
> for i in {0..1000}
> do
> echo $i
> evolution &
> sleep 5
> killall evolution
> sleep 55
> done
>
> This is also the method I used to test the debug kernel in the following mails.
>
> Bert Karwatzki
Given the perpetrator and victim here don't share a common upstream root
port (the only common is the root complex) I wonder if this is actually
an issue with something non-obvious like the IOMMU.
Can you still reproduce with amd_iommu=off?
Powered by blists - more mailing lists