linux-kernel - Re: Crash during resume of pcie bridge due to infinite loop in ACPICA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3f790ee59129e5e49dd875526cb308cc4d97b99d.camel@web.de>
Date: Sun, 16 Nov 2025 22:08:54 +0100
From: Bert Karwatzki <spasswolf@....de>
To: Christian König <christian.koenig@....com>, "Mario
 Limonciello (AMD) (kernel.org)"
	 <superm1@...nel.org>, linux-kernel@...r.kernel.org
Cc: linux-next@...r.kernel.org, regressions@...ts.linux.dev, 
	linux-pci@...r.kernel.org, linux-acpi@...r.kernel.org, "Rafael J . Wysocki"
	 <rafael.j.wysocki@...el.com>, spasswolf@....de,
 acpica-devel@...ts.linux.dev,  Robert Moore <robert.moore@...el.com>
Subject: Re: Crash during resume of pcie bridge due to infinite loop in
 ACPICA

Am Montag, dem 10.11.2025 um 14:33 +0100 schrieb Christian König:
> Hi Bert,
> 
> well sorry to say that but from your dumps it looks more and more like you just have faulty HW.
> 
> An SMU response of 0xFFFFFFFF means that the device has spontaneously fallen of the bus while trying to resume it.
> 
> My educated guess is that this is caused by a faulty power management, but basically it could be anything.
> 
> Regards,
> Christian.

I think there may be more than one error here. The loss of the GPU (with SMU respone log message) may be caused
by faulty hardware but does not cause "the" crash (i.e. the crash which showed no log messages and was so hard
one of my nvme devices was missing temporarily afterward, and which caused me to investigate this in the first place ...).

As bisection of the crash is impossible I went back to inserting printk()s into acpi_power_transition() and the
functions called by it. To reduce log spam I created _debug suffixed copies of the original functions.
The code is found here in branch amdgpu_suspend_resume:
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume?ref_type=heads
(Should I post the debug patches to the list?)

The last two commits finally cleared up what happens (but I've yet to find out why this happens).

6.14.0-debug-00014-g2e933c56f3b6	booted 20:17, 15.11.2025 crashed 0:50, 16.11.2025
					(~4.5h, 518 GPP0 events, 393 GPU resumes)

The interesting part of the instrumented code is this:

acpi_status acpi_ps_parse_aml_debug(struct acpi_walk_state *walk_state)
{
	[...]
	printk(KERN_INFO "%s: before walk loop\n", __func__);
	while (walk_state) {
		if (ACPI_SUCCESS(status)) {
			/*
			 * The parse_loop executes AML until the method terminates
			 * or calls another method.
			 */
			status = acpi_ps_parse_loop(walk_state);
		}
	[...]
	}
	printk(KERN_INFO "%s: after walk loop\n", __func__);
	[...]
}

This gives the following message in netconsole
1. No crash:
2025-11-16T00:50:35.634745+01:00 10.0.0.1 6,21514,16419759755,-,caller=T59901;acpi_ps_execute_method_debug 329
2025-11-16T00:50:35.634745+01:00 10.0.0.1 6,21515,16419759781,-,caller=T59901;acpi_ps_parse_aml_debug: before walk loop
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21516,16420046219,-,caller=T59901;acpi_ps_parse_aml_debug: after walk loop
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21517,16420046231,-,caller=T59901;acpi_ps_execute_method_debug 331
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21518,16420046235,-,caller=T59901;acpi_ns_evaluate_debug 475 METHOD
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21519,16420046240,-,caller=T59901;acpi_evaluate_object_debug 255
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21520,16420046244,-,caller=T59901;__acpi_power_on_debug 369
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21521,16420046248,-,caller=T59901;acpi_power_on_unlocked_debug 446
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21522,16420046251,-,caller=T59901;acpi_power_on_debug 471
2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21523,16420046255,-,caller=T59901;acpi_power_on_list_debug 642: result = 0
Resume successful, normal messages from resuming GPU follow.

2. Crash:
2025-11-16T00:50:46.483555+01:00 10.0.0.1 6,21566,16430609060,-,caller=T59702;acpi_ps_execute_method_debug 329
2025-11-16T00:50:46.483555+01:00 10.0.0.1 6,21567,16430609083,-,caller=T59702;acpi_ps_parse_aml_debug: before walk loop
No more messages via netconsole due to crash.

So here we can already say that the main loop in acpi_ps_parse_aml_debug() is not finishing properly.

The next step is to put monitoring inside the loop:

6.14.0-debug-00015-gc09fd8dd0492	booted 12:09, 16.11.2025 crashed 19:55, 16.11.2025 
					(~8h, 1539 GPP0 events, 587 GPU resumes) "infinite" walk loop

The interesting part of the instrumented code is this:

acpi_status acpi_ps_parse_aml_debug(struct acpi_walk_state *walk_state)
{
	[...]
	printk(KERN_INFO "%s: before walk loop\n", __func__);
	while (walk_state) {
		if (ACPI_SUCCESS(status)) {
			/*
			 * The parse_loop executes AML until the method terminates
			 * or calls another method.
			 */
			printk(KERN_INFO "%s: before parse loop\n", __func__);
			status = acpi_ps_parse_loop(walk_state);
			printk(KERN_INFO "%s: after parse loop\n", __func__);
		}
	[...]
	}
	printk(KERN_INFO "%s: after walk loop\n", __func__);
	[...]
}

This gives the following message in netconsole
1. No crash:
2025-11-16T19:55:54.203765+01:00 localhost 6,5479352,28054924877,-,caller=T5967;acpi_ps_execute_method_debug 329
2025-11-16T19:55:54.203765+01:00 localhost 6,5479353,28054924889,-,caller=T5967;acpi_ps_parse_aml_debug: before walk loop
The next two lines are repeated 1500-1700 times (it varies a little ...):
2025-11-16T19:55:54.203765+01:00 localhost 6,5479354,28054924894,-,caller=T5967;acpi_ps_parse_aml_debug: before parse loop
2025-11-16T19:55:54.203765+01:00 localhost 6,5479355,28054924908,-,caller=T5967;acpi_ps_parse_aml_debug: after parse loop

2025-11-16T19:55:54.498216+01:00 localhost 6,5482288,28055219778,-,caller=T5967;acpi_ps_parse_aml_debug: after walk loop
2025-11-16T19:55:54.498216+01:00 localhost 6,5482289,28055219782,-,caller=T5967;acpi_ps_execute_method_debug 331
2025-11-16T19:55:54.498233+01:00 localhost 6,5482290,28055219786,-,caller=T5967;acpi_ns_evaluate_debug 475 METHOD
2025-11-16T19:55:54.498233+01:00 localhost 6,5482291,28055219791,-,caller=T5967;acpi_evaluate_object_debug 255
2025-11-16T19:55:54.498233+01:00 localhost 6,5482292,28055219795,-,caller=T5967;__acpi_power_on_debug 369
2025-11-16T19:55:54.498247+01:00 localhost 6,5482293,28055219799,-,caller=T5967;acpi_power_on_unlocked_debug 446
2025-11-16T19:55:54.498247+01:00 localhost 6,5482294,28055219802,-,caller=T5967;acpi_power_on_debug 471
2025-11-16T19:55:54.498247+01:00 localhost 6,5482295,28055219806,-,caller=T5967;acpi_power_on_list_debug 642: result = 0
Resume successful, normal messages from resuming GPU follow.

2. Crash:
2025-11-16T19:56:24.213495+01:00 localhost 6,5483042,28084932950,-,caller=T5967;acpi_ps_execute_method_debug 329
2025-11-16T19:56:24.213495+01:00 localhost 6,5483043,28084932965,-,caller=T5967;acpi_ps_parse_aml_debug: before walk loop
The next two lines are repeated more than 30000 times, then the transmition stops due to the crash:
2025-11-16T19:56:24.213495+01:00 localhost 6,5483044,28084932971,-,caller=T5967;acpi_ps_parse_aml_debug: before parse loop
2025-11-16T19:56:24.213495+01:00 localhost 6,5483045,28084932991,-,caller=T5967;acpi_ps_parse_aml_debug: after parse loop
No more messages via netconsole due to crash.

So there is some kind of infinite recursion happening inside the loop in acpi_ps_parse_aml(). Even if there is some kind
of error in the hardware this shouldn't happen, I think.

This bug is present in every kernel version I've tested so far, that is 6.12.x, 6.13.x, 6.14.x,
6.15.x, 6.16.x, 6.17.x (here I only tested the release candidates). 6.18 has not been tested, yet.

To get to this result took several months of 24/7 test runs, I hope resolving this will
be faster.

Bert Karwatzki