lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251006120944.7880-1-spasswolf@web.de>
Date: Mon,  6 Oct 2025 14:09:39 +0200
From: Bert Karwatzki <spasswolf@....de>
To: linux-kernel@...r.kernel.org
Cc: Bert Karwatzki <spasswolf@....de>,
	linux-next@...r.kernel.org,
	linux-stable@...r.kernel.org,
	regressions@...ts.linux.dev,
	linux-pci@...r.kernel.org,
	linux-acpi@...r.kernel.org,
	Mario Limonciello <superm1@...nel.org>,
	Christian König <christian.koenig@....com>,
	"Rafael J . Wysocki" <rafael.j.wysocki@...el.com>
Subject: [REGRESSION 00/04] Crash during resume of pcie bridge

Since linux version v6.15 I experience random crashes on my MSI Alpha 15 Laptop
running debian trixie (amd64). The first such crash happened about in the midth
of june, and as there were no useful log messages and even using netconsole
gave no useful message I suspected faulty hardware. So I ran memtest86+ and
found a faulty address line and replaced the memory (unfortunately 64G to 16G).
But the crashes occured again and so I did a thorough investigation.

The crashes occur after 30min to 33h (yes, hours) of uptime and consist of a
sudden reboot after which the PCI bridge at 00:02.4 and the nvme device 
connected to it are missing. If there's sound running during the crash then the
first sign of the crash is the sound looping like a broken record for about 2s,
after which the reboot happens. With the missing nvme device the reboot drops to
a rescue shell. Using "shutdown -h now" from that shell and starting the laptop
with the power button restores the missing PCI bridge and nvme device.

The hardware is the following (it's a dual GPU laptop where the GUI
runs on the built-in GPU):

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 80
model name	: AMD Ryzen 7 5800H with Radeon Graphics
stepping	: 0
microcode	: 0xa50000c
cpu MHz		: 3394.238
cache size	: 512 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso ibpb_no_ret
bogomips	: 6388.57
TLB size	: 2560 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

$ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 [1022:166a]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 [1022:166b]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 [1022:166c]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 [1022:166d]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 [1022:166e]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 [1022:166f]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 [1022:1670]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 [1022:1671]
01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
06:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] [2646:5013] (rev 01)
07:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P1 NVMe PCIe SSD[Frampton] [c0a9:2263] (rev 03)
08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
08:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
08:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Audio Coprocessor [1022:15e2] (rev 01)
08:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller [1022:15e3]
08:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4]

These devices are attached to the PCI bus like this:

$ lspci -t
-[0000:00]-+-00.0
           +-00.2
           +-01.0
           +-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0 // This is the bridge which causes the crash
           |                                            \-00.1
           +-02.0
           +-02.1-[04]----00.0
           +-02.2-[05]----00.0
           +-02.3-[06]----00.0
           +-02.4-[07]----00.0 // These are the bridge and nvme device which disappear after the crash.
           +-08.0
           +-08.1-[08]--+-00.0
           |            +-00.1
           |            +-00.2
           |            +-00.3
           |            +-00.4
           |            +-00.5
           |            +-00.6
           |            \-00.7
           +-14.0
           +-14.3
           +-18.0
           +-18.1
           +-18.2
           +-18.3
           +-18.4
           +-18.5
           +-18.6
           \-18.7

I tried to bisect this between v6.14 and v6.15 but due to the wildly varying time
it takes to trigger the bug the bisections were not successful. Nevertheless they
gave lots of data about affected and non-affected version of the linux kernel,
and it's quite likely that version v6.14 is indeed free of the bug.

Here's an almost complete list of tested versions:
(Somewhat) sorted (by kernel version, 6.14.0-rc* kernels are from attempted bisections
between v6.14 and v6.15)
v6.14.0							no crash after 16h
v6.14.11						no crash after 7.5h
6.14.0-rc1-bisect-00003-g541ddf31e300			booted 12:24, 22.8.2025, no crash after {48h, 17h}
6.14.0-rc1-mystery-00134-gcc28c0e5e725			booted 11:42, 5.8.2025, no crash after 10.5h
6.14.0-rc1-mystery-00198-gd7f6f07ecec9			booted 22:27, 5.8.2025, no crash after 12h
6.14.0-rc4-mystery-01022-gab498828fad7			booted 21:04, 3.8.2025, no crash after {14h, 24h} 
6.14.0-rc4-mystery-01427-g7547510d4a91			booted 11:11, 4.8.2025, no crash after {13h, 23h}
6.14.0-rc6-mystery-01641-g0f04462874e1			booted 00:26, 5.8.2025, no crash after {11h, 24h}
6.14.0-mystery-00826-g327ecdbc0fda			no crash after {16h, 17h, 6.5h}
############## here the crashes start (time to each crash, crashes do not always occur) ########
6.14.0-bisect-01053-gebfb94d87b35			booted 10:15, 20.8.2025 crash after ~33h
6.14.0-mystery-09584-g7d06015d936c			crash 20.44 3.8.2025 after 7h
6.14.0-mystery-11703-geb0ece16027f      		crash 13.22 3.8.2025 after 1.75h
6.15.0							crashed around 15-17.6.2025, unknown uptime (This is the first crash!)
6.15.0-nort  						crash after 6.75h
6.16-rc4 (next-20250627)				crash after ~4h
6.16-rc4 (next-20250630)				crash after ~5h
6.16-rc4 (next-20250703) 				crash after ~2.5h (sound buffer repeated for ~1s before restarting) 	
6.16-rc6 (next-20250718)				crash after {2h, 2h}
6.16-rc7 (next-20250721)				crash after {~30min, 2h, 5.5h}
6.16.0-nortlockdep					crash after 4h
6.17.0-rc4-next-20250902-master				booted 8:36, 3.9.2025, crash after ~3.5h
6.17.0-rc5-next-20250908-master				booted 10:25, 9.9.2025, crash after {~6.5h, 14h}
6.17.0-rc6-next-20250917-acpidebug 			booted 12:41, 20.9.2025, crash 15:22 20.8.2025 (~3h, 647 GPP notifies)
The versions below contain additional debugging printk()s and dev_info()s.
The details of these debugging statements are explained below.
6.17.0-rc6-next-20250917-gpudebug-00018-g7a38b625a003	booted 12:58, 26.9.2025, crash 12:01, 27.9.2025 (~23h, 1500 GPP notifies)
6.17.0-rc6-next-20250917-gpudebug-00021-gab98d880e3c8	booted 23:52, 28.9.2025, crash 2:25, 30.9.2025 (26.5h, 1504GPP0, 889GPP2)
6.17.0-rc6-next-20250917-gpudebug-00024-g5c6b49b810db	booted 9:10, 2.10.2025, 60h 3093 GPP0 notifies without crash (too many printk()s?)
6.17.0-rc6-next-20250917-gpudebug-00028-gf99cf81b1da7	booted 21:21, 4.10.2025 first try stopped after 77min due to hung tasks
6.17.0-rc6-next-20250917-gpudebug-00028-gf99cf81b1da7	booted 23:37, 4.10.2025 crash 4:52, 6.10.2025 (~27.5h)
6.17.0-rc6-next-20250917-gpudebug-00029-ge797f42363d1	booted 13:00, 6.10.2025 currently testing

As the bisections were not succesfull I tried to monitor the crash using
netconsole and CONFIG_ACPI_DEBUG and "acpi.debug_layer=0xf acpi.debug_level=0x107"
as command line parameters. With this the last message on netconsole before
the crash is usually:

[21465.639279] [    T251]    evmisc-0132 ev_queue_notify_reques: Dispatching Notify on [GPP0] (Device) Value 0x00 (Bus Check) Node 00000000f81f36b8

GPP0 is the ACPI name of this PCI bridge (at least that's my best guess):

00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]

to which the discrete GPU is connected

03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)

via the pci express switch

01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]

While the GUI (xfce on xorg) on my laptop runs on the built-in GPU the discrete 
GPU usually wakes up quite often, e.g. when a window is opened or when scrolling down on youtube.

A somewhat reliable method to generate GPP0 notifies is putting on a youtube
video and the periodically starting evolution with this script:

#!/bin/bash
for i in {0..1000}
do
	echo $i
	evolution &
	sleep 5
	killall evolution
	sleep 55
done

This is also the method I used to test the debug kernel in the following mails.

Bert Karwatzki

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ