[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJZ5v0iRaYBU+1S4rqYR7D6XC+rfQ2+0hgbodweV5JsFr8EEnQ@mail.gmail.com>
Date: Mon, 17 Nov 2025 17:40:42 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Bert Karwatzki <spasswolf@....de>
Cc: Christian König <christian.koenig@....com>,
"Mario Limonciello (AMD) (kernel.org)" <superm1@...nel.org>, linux-kernel@...r.kernel.org,
linux-next@...r.kernel.org, regressions@...ts.linux.dev,
linux-pci@...r.kernel.org, linux-acpi@...r.kernel.org,
"Rafael J . Wysocki" <rafael.j.wysocki@...el.com>, acpica-devel@...ts.linux.dev,
Robert Moore <robert.moore@...el.com>, Saket Dumbre <saket.dumbre@...el.com>
Subject: Re: Crash during resume of pcie bridge due to infinite loop in ACPICA
+Saket
On Sun, Nov 16, 2025 at 10:09 PM Bert Karwatzki <spasswolf@....de> wrote:
>
> Am Montag, dem 10.11.2025 um 14:33 +0100 schrieb Christian König:
> > Hi Bert,
> >
> > well sorry to say that but from your dumps it looks more and more like you just have faulty HW.
> >
> > An SMU response of 0xFFFFFFFF means that the device has spontaneously fallen of the bus while trying to resume it.
> >
> > My educated guess is that this is caused by a faulty power management, but basically it could be anything.
> >
> > Regards,
> > Christian.
>
> I think there may be more than one error here. The loss of the GPU (with SMU respone log message) may be caused
> by faulty hardware but does not cause "the" crash (i.e. the crash which showed no log messages and was so hard
> one of my nvme devices was missing temporarily afterward, and which caused me to investigate this in the first place ...).
>
> As bisection of the crash is impossible I went back to inserting printk()s into acpi_power_transition() and the
> functions called by it. To reduce log spam I created _debug suffixed copies of the original functions.
> The code is found here in branch amdgpu_suspend_resume:
> https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume?ref_type=heads
> (Should I post the debug patches to the list?)
>
> The last two commits finally cleared up what happens (but I've yet to find out why this happens).
>
> 6.14.0-debug-00014-g2e933c56f3b6 booted 20:17, 15.11.2025 crashed 0:50, 16.11.2025
> (~4.5h, 518 GPP0 events, 393 GPU resumes)
>
> The interesting part of the instrumented code is this:
>
> acpi_status acpi_ps_parse_aml_debug(struct acpi_walk_state *walk_state)
> {
> [...]
> printk(KERN_INFO "%s: before walk loop\n", __func__);
> while (walk_state) {
> if (ACPI_SUCCESS(status)) {
> /*
> * The parse_loop executes AML until the method terminates
> * or calls another method.
> */
> status = acpi_ps_parse_loop(walk_state);
> }
> [...]
> }
> printk(KERN_INFO "%s: after walk loop\n", __func__);
> [...]
> }
>
> This gives the following message in netconsole
> 1. No crash:
> 2025-11-16T00:50:35.634745+01:00 10.0.0.1 6,21514,16419759755,-,caller=T59901;acpi_ps_execute_method_debug 329
> 2025-11-16T00:50:35.634745+01:00 10.0.0.1 6,21515,16419759781,-,caller=T59901;acpi_ps_parse_aml_debug: before walk loop
> 2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21516,16420046219,-,caller=T59901;acpi_ps_parse_aml_debug: after walk loop
> 2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21517,16420046231,-,caller=T59901;acpi_ps_execute_method_debug 331
> 2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21518,16420046235,-,caller=T59901;acpi_ns_evaluate_debug 475 METHOD
> 2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21519,16420046240,-,caller=T59901;acpi_evaluate_object_debug 255
> 2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21520,16420046244,-,caller=T59901;__acpi_power_on_debug 369
> 2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21521,16420046248,-,caller=T59901;acpi_power_on_unlocked_debug 446
> 2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21522,16420046251,-,caller=T59901;acpi_power_on_debug 471
> 2025-11-16T00:50:35.921210+01:00 10.0.0.1 6,21523,16420046255,-,caller=T59901;acpi_power_on_list_debug 642: result = 0
> Resume successful, normal messages from resuming GPU follow.
>
> 2. Crash:
> 2025-11-16T00:50:46.483555+01:00 10.0.0.1 6,21566,16430609060,-,caller=T59702;acpi_ps_execute_method_debug 329
> 2025-11-16T00:50:46.483555+01:00 10.0.0.1 6,21567,16430609083,-,caller=T59702;acpi_ps_parse_aml_debug: before walk loop
> No more messages via netconsole due to crash.
>
> So here we can already say that the main loop in acpi_ps_parse_aml_debug() is not finishing properly.
>
> The next step is to put monitoring inside the loop:
>
> 6.14.0-debug-00015-gc09fd8dd0492 booted 12:09, 16.11.2025 crashed 19:55, 16.11.2025
> (~8h, 1539 GPP0 events, 587 GPU resumes) "infinite" walk loop
>
> The interesting part of the instrumented code is this:
>
> acpi_status acpi_ps_parse_aml_debug(struct acpi_walk_state *walk_state)
> {
> [...]
> printk(KERN_INFO "%s: before walk loop\n", __func__);
> while (walk_state) {
> if (ACPI_SUCCESS(status)) {
> /*
> * The parse_loop executes AML until the method terminates
> * or calls another method.
> */
> printk(KERN_INFO "%s: before parse loop\n", __func__);
> status = acpi_ps_parse_loop(walk_state);
> printk(KERN_INFO "%s: after parse loop\n", __func__);
> }
> [...]
> }
> printk(KERN_INFO "%s: after walk loop\n", __func__);
> [...]
> }
>
> This gives the following message in netconsole
> 1. No crash:
> 2025-11-16T19:55:54.203765+01:00 localhost 6,5479352,28054924877,-,caller=T5967;acpi_ps_execute_method_debug 329
> 2025-11-16T19:55:54.203765+01:00 localhost 6,5479353,28054924889,-,caller=T5967;acpi_ps_parse_aml_debug: before walk loop
> The next two lines are repeated 1500-1700 times (it varies a little ...):
> 2025-11-16T19:55:54.203765+01:00 localhost 6,5479354,28054924894,-,caller=T5967;acpi_ps_parse_aml_debug: before parse loop
> 2025-11-16T19:55:54.203765+01:00 localhost 6,5479355,28054924908,-,caller=T5967;acpi_ps_parse_aml_debug: after parse loop
>
> 2025-11-16T19:55:54.498216+01:00 localhost 6,5482288,28055219778,-,caller=T5967;acpi_ps_parse_aml_debug: after walk loop
> 2025-11-16T19:55:54.498216+01:00 localhost 6,5482289,28055219782,-,caller=T5967;acpi_ps_execute_method_debug 331
> 2025-11-16T19:55:54.498233+01:00 localhost 6,5482290,28055219786,-,caller=T5967;acpi_ns_evaluate_debug 475 METHOD
> 2025-11-16T19:55:54.498233+01:00 localhost 6,5482291,28055219791,-,caller=T5967;acpi_evaluate_object_debug 255
> 2025-11-16T19:55:54.498233+01:00 localhost 6,5482292,28055219795,-,caller=T5967;__acpi_power_on_debug 369
> 2025-11-16T19:55:54.498247+01:00 localhost 6,5482293,28055219799,-,caller=T5967;acpi_power_on_unlocked_debug 446
> 2025-11-16T19:55:54.498247+01:00 localhost 6,5482294,28055219802,-,caller=T5967;acpi_power_on_debug 471
> 2025-11-16T19:55:54.498247+01:00 localhost 6,5482295,28055219806,-,caller=T5967;acpi_power_on_list_debug 642: result = 0
> Resume successful, normal messages from resuming GPU follow.
>
> 2. Crash:
> 2025-11-16T19:56:24.213495+01:00 localhost 6,5483042,28084932950,-,caller=T5967;acpi_ps_execute_method_debug 329
> 2025-11-16T19:56:24.213495+01:00 localhost 6,5483043,28084932965,-,caller=T5967;acpi_ps_parse_aml_debug: before walk loop
> The next two lines are repeated more than 30000 times, then the transmition stops due to the crash:
> 2025-11-16T19:56:24.213495+01:00 localhost 6,5483044,28084932971,-,caller=T5967;acpi_ps_parse_aml_debug: before parse loop
> 2025-11-16T19:56:24.213495+01:00 localhost 6,5483045,28084932991,-,caller=T5967;acpi_ps_parse_aml_debug: after parse loop
> No more messages via netconsole due to crash.
>
> So there is some kind of infinite recursion happening inside the loop in acpi_ps_parse_aml(). Even if there is some kind
> of error in the hardware this shouldn't happen, I think.
>
> This bug is present in every kernel version I've tested so far, that is 6.12.x, 6.13.x, 6.14.x,
> 6.15.x, 6.16.x, 6.17.x (here I only tested the release candidates). 6.18 has not been tested, yet.
>
> To get to this result took several months of 24/7 test runs, I hope resolving this will
> be faster.
Well, what you have found appears to be an issue in the AML bytecode
interpreter which may be one of two things: (1) a bug in the
interpreter itself or (2) a bytecode issue that causes the interpreter
to crash (eventually) and the latter is quite a bit more likely.
I'd suggest opening a new issue at
https://github.com/acpica/acpica/issues and attaching the acpidump
output from the affected system, to start with.
Powered by blists - more mailing lists