[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <232324a9-e82d-40b3-b88b-538947411a24@amd.com>
Date: Mon, 6 Oct 2025 14:39:18 +0200
From: Christian König <christian.koenig@....com>
To: Bert Karwatzki <spasswolf@....de>, linux-kernel@...r.kernel.org
Cc: linux-next@...r.kernel.org, linux-stable@...r.kernel.org,
regressions@...ts.linux.dev, linux-pci@...r.kernel.org,
linux-acpi@...r.kernel.org, Mario Limonciello <superm1@...nel.org>,
"Rafael J . Wysocki" <rafael.j.wysocki@...el.com>
Subject: Re: [REGRESSION 00/04] Crash during resume of pcie bridge
On 06.10.25 14:09, Bert Karwatzki wrote:
> Since linux version v6.15 I experience random crashes on my MSI Alpha 15 Laptop
> running debian trixie (amd64). The first such crash happened about in the midth
> of june, and as there were no useful log messages and even using netconsole
> gave no useful message I suspected faulty hardware. So I ran memtest86+ and
> found a faulty address line and replaced the memory (unfortunately 64G to 16G).
> But the crashes occured again and so I did a thorough investigation.
>
> The crashes occur after 30min to 33h (yes, hours) of uptime and consist of a
> sudden reboot after which the PCI bridge at 00:02.4 and the nvme device
> connected to it are missing. If there's sound running during the crash then the
> first sign of the crash is the sound looping like a broken record for about 2s,
> after which the reboot happens. With the missing nvme device the reboot drops to
> a rescue shell. Using "shutdown -h now" from that shell and starting the laptop
> with the power button restores the missing PCI bridge and nvme device.
Oh well, it sounds like some PCIe device is dropping of the bus and taking it's upstream bridge with it.
> As the bisections were not succesfull I tried to monitor the crash using
> netconsole and CONFIG_ACPI_DEBUG and "acpi.debug_layer=0xf acpi.debug_level=0x107"
> as command line parameters. With this the last message on netconsole before
> the crash is usually:
>
> [21465.639279] [ T251] evmisc-0132 ev_queue_notify_reques: Dispatching Notify on [GPP0] (Device) Value 0x00 (Bus Check) Node 00000000f81f36b8
A full dump of that might be helpful. That sounds like the dGPU is powering up/down.
>
> GPP0 is the ACPI name of this PCI bridge (at least that's my best guess):
>
> 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
>
> to which the discrete GPU is connected
>
> 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
>
> via the pci express switch
>
> 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
> 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
>
> While the GUI (xfce on xorg) on my laptop runs on the built-in GPU the discrete
> GPU usually wakes up quite often, e.g. when a window is opened or when scrolling down on youtube.
Yeah, that is a known issue and we are working on it.
Basically an application enumerates the possible render or video decode devices in the system and that wakes up the dGPU even when it isn't actually used.
> A somewhat reliable method to generate GPP0 notifies is putting on a youtube
> video and the periodically starting evolution with this script:
>
> #!/bin/bash
> for i in {0..1000}
> do
> echo $i
> evolution &
> sleep 5
> killall evolution
> sleep 55
> done
>
> This is also the method I used to test the debug kernel in the following mails.
To further narrow down the issue please run your laptop with amdgpu.runpm=0 on the kernel command line for a while and see if that is stable or not.
Thanks,
Christian.
>
> Bert Karwatzki
Powered by blists - more mailing lists