[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ec99725df78fdd0fd9d4398d00fdebb84cb38ee6.camel@web.de>
Date: Fri, 05 Dec 2025 11:05:16 +0100
From: Bert Karwatzki <spasswolf@....de>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: Christian König <christian.koenig@....com>, "Mario
Limonciello (AMD) (kernel.org)" <superm1@...nel.org>,
linux-kernel@...r.kernel.org, linux-next@...r.kernel.org,
regressions@...ts.linux.dev, linux-pci@...r.kernel.org,
linux-acpi@...r.kernel.org, "Rafael J . Wysocki"
<rafael.j.wysocki@...el.com>, acpica-devel@...ts.linux.dev, Robert Moore
<robert.moore@...el.com>, Saket Dumbre <saket.dumbre@...el.com>,
spasswolf@....de, Bjorn Helgaas <bhelgaas@...gle.com>
Subject: Re: Crash during resume of pcie bridge due to incorrect error
handling
I've got good and bad news on this. The good news first:
I was able to get through a failed resume without a crash with the following
changes to the pci resume process:
1: pci_pm_bridge_power_up_actions() returns -ENODV when pci_bridge_wait_for_secondary()
returns an error (pci_walk_bus_debug() is just pci_walk_bus() with added printk()s),
and also sets state to PCI_D3cold on failure:
static int pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
{
int ret;
dev_info(&pci_dev->dev, "%s %d\n", __func__, __LINE__);
ret = pci_bridge_wait_for_secondary_bus(pci_dev, "resume");
dev_info(&pci_dev->dev, "%s %d: ret = %d\n", __func__, __LINE__, ret);
if (ret) {
/*
* The downstream link failed to come up, so mark the
* devices below as disconnected to make sure we don't
* attempt to resume them.
*/
pci_walk_bus_debug(pci_dev->subordinate, pci_dev_set_disconnected,
NULL);
pci_update_current_state(pci_dev, PCI_D3cold);
dev_info(&pci_dev->dev, "%s: bridge failed to power up\n", __func__);
return -ENODEV;
}
/*
* When powering on a bridge from D3cold, the whole hierarchy may be
* powered on into D0uninitialized state, resume them to give them a
* chance to suspend again
*/
pci_resume_bus(pci_dev->subordinate);
return 0;
}
2. pci_pm_runtime_resume() has an early exit if pci_pm_bridge_power_up_actions() fails:
static int pci_pm_runtime_resume(struct device *dev)
{
struct pci_dev *pci_dev = to_pci_dev(dev);
const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
pci_power_t prev_state = pci_dev->current_state;
int error = 0;
/*
* Restoring config space is necessary even if the device is not bound
* to a driver because although we left it in D0, it may have gone to
* D3cold when the bridge above it runtime suspended.
*/
pci_pm_default_resume_early(pci_dev);
pci_resume_ptm(pci_dev);
if (!pci_dev->driver)
return 0;
pci_fixup_device(pci_fixup_resume_early, pci_dev);
pci_pm_default_resume(pci_dev);
if (prev_state == PCI_D3cold) {
error = pci_pm_bridge_power_up_actions(pci_dev);
if (error)
return error;
}
[...]
}
3. acpiphp_check_bridge() has an early exit if pm_runtime_get_sync()
returns an error (pm_runtime_get_sync() basically returns the result of
pci_pm_runtime_resume() which is called as a callback in rpm_resume())
static void acpiphp_check_bridge(struct acpiphp_bridge *bridge)
{
struct acpiphp_slot *slot;
int ret;
/* Bail out if the bridge is going away. */
if (bridge->is_going_away)
return;
if (bridge->pci_dev) {
ret = pm_runtime_get_sync(&bridge->pci_dev->dev);
if (ret < 0) {
dev_info(&bridge->pci_dev->dev, "%s: pm_runtime_get_sync() failed with ret = %d\n", __func__, ret);
return;
}
}
With these changes I get the following messages in dmesg when acpi_power_on_list() fails
for the pci bridge:
2025-12-05T01:58:53.260991+01:00 lisa kernel: [ T1772] acpi_power_on_list_debug 649: result = -19
2025-12-05T01:58:53.260993+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: pci_pm_default_resume_early 567
2025-12-05T01:58:53.260994+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: pci_pm_bridge_power_up_actions 576
2025-12-05T01:58:54.282032+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
2025-12-05T01:58:55.282033+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: retraining failed
2025-12-05T01:58:55.282053+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec
2025-12-05T01:58:55.282055+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: pci_pm_bridge_power_up_actions 578: ret = -25
2025-12-05T01:58:55.282057+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: __pci_walk_bus_debug 0
2025-12-05T01:58:55.282058+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: pci_dev_set_disconnected: 0
2025-12-05T01:58:55.282060+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: pci_dev_set_disconnected: 1
2025-12-05T01:58:55.282063+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: __pci_walk_bus_debug 1: ret = 0
2025-12-05T01:58:55.282065+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: __pci_walk_bus_debug 1.0
2025-12-05T01:58:55.282067+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: __pci_walk_bus_debug 0
2025-12-05T01:58:55.282070+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: pci_dev_set_disconnected: 0
2025-12-05T01:58:55.282073+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: pci_dev_set_disconnected: 1
2025-12-05T01:58:55.282101+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: __pci_walk_bus_debug 1: ret = 0
2025-12-05T01:58:55.282103+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: __pci_walk_bus_debug 1.0
2025-12-05T01:58:55.282105+01:00 lisa kernel: [ T1772] amdgpu 0000:03:00.0: __pci_walk_bus_debug 0
2025-12-05T01:58:55.282107+01:00 lisa kernel: [ T1772] amdgpu 0000:03:00.0: pci_dev_set_disconnected: 0
2025-12-05T01:58:55.282109+01:00 lisa kernel: [ T1772] amdgpu 0000:03:00.0: pci_dev_set_disconnected: 1
2025-12-05T01:58:55.282111+01:00 lisa kernel: [ T1772] amdgpu 0000:03:00.0: __pci_walk_bus_debug 1: ret = 0
2025-12-05T01:58:55.282113+01:00 lisa kernel: [ T1772] snd_hda_intel 0000:03:00.1: __pci_walk_bus_debug 0
2025-12-05T01:58:55.282115+01:00 lisa kernel: [ T1772] snd_hda_intel 0000:03:00.1: pci_dev_set_disconnected: 0
2025-12-05T01:58:55.282116+01:00 lisa kernel: [ T1772] snd_hda_intel 0000:03:00.1: pci_dev_set_disconnected: 1
2025-12-05T01:58:55.282118+01:00 lisa kernel: [ T1772] snd_hda_intel 0000:03:00.1: __pci_walk_bus_debug 1: ret = 0
2025-12-05T01:58:55.282120+01:00 lisa kernel: [ T1772] __pci_walk_bus_debug: ret = 0
2025-12-05T01:58:55.282122+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: __pci_walk_bus_debug 1.1: ret = 0
2025-12-05T01:58:55.282124+01:00 lisa kernel: [ T1772] __pci_walk_bus_debug: ret = 0
2025-12-05T01:58:55.282126+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: __pci_walk_bus_debug 1.1: ret = 0
2025-12-05T01:58:55.282128+01:00 lisa kernel: [ T1772] __pci_walk_bus_debug: ret = 0
2025-12-05T01:58:55.282130+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: pci_pm_bridge_power_up_actions: bridge failed to power up
2025-12-05T01:58:55.282131+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 916 retval = -19
2025-12-05T01:58:55.282133+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 922
2025-12-05T01:58:55.282135+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 937
2025-12-05T01:58:55.282137+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 947
2025-12-05T01:58:55.282138+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 950
2025-12-05T01:58:55.282140+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -19
2025-12-05T01:58:55.282141+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:58:55.523974+01:00 lisa kernel: [T192257] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:21.896980+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:21.896986+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:22.141965+01:00 lisa kernel: [T192400] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:51.903964+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:51.903974+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:52.153973+01:00 lisa kernel: [T192554] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:21.915965+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:21.915971+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:22.161971+01:00 lisa kernel: [T192706] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:51.922981+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:51.922988+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:52.175975+01:00 lisa kernel: [T192899] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:52.931997+01:00 lisa kernel: [T180476] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:52.932008+01:00 lisa kernel: [T180476] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:01:21.933961+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:21.933965+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:22.172962+01:00 lisa kernel: [T193054] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:23.001998+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:23.002009+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:01:51.942980+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:51.942986+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:52.188969+01:00 lisa kernel: [T193209] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:52.958995+01:00 lisa kernel: [T188262] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:52.959009+01:00 lisa kernel: [T188262] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:02:21.951971+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:21.951980+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:22.205974+01:00 lisa kernel: [T193359] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:23.117988+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:23.118000+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:02:51.963970+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:51.963976+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:52.209969+01:00 lisa kernel: [T193512] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:53.465055+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:53.465074+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:02:57.133996+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:57.134006+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
These lines continue, all further resumes fail but no crash occurs.
The complete debug code used is here (it's rather messy though):
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume?ref_type=heads
I've not tested yet, If the fix above also works on a pure v6.14 without all the debug patches.
The bad news is I've encountered another version of the crash. (at least 90% of these crashes occur after
acpi_power_on() fails as above, but at least 2 crashes happened on suspend) Now a third version has appeared
which occured after an SMU resume failure in amdgpu (up to now SMU failure have appeared but they did not
result in a crash like this)
2025-12-04T12:52:19.589753+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: SMU is resuming...#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.589753+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0,
version = 0x003b3100 (59.49.0)#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.589753+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:6 param:0x00000000 message:EnableAllSmuFeatures?#012
SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: Failed to enable requested dpm features!#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: resume of IP block <smu> failed -121#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-121).#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: pci_pm_runtime_resume 1380 error = -121 state = 0x0#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
From here on the system hangs for quite a long time (9s!) while these error messages appear on netconsole:
2025-12-04T12:52:20.870639+01:00 C0;INFO: NMI handler (perf_event_nmi_handler) took too long to run: 122.609 msecs
2025-12-04T12:52:21.072400+01:00 C0;perf: interrupt took too long (958843 > 2500), lowering kernel.perf_event_max_sample_rate to 1000
2025-12-04T12:52:21.879308+01:00 C5;INFO: NMI handler (perf_event_nmi_handler) took too long to run: 138.016 msecs
2025-12-04T12:52:23.493066+01:00 C13;INFO: NMI handler (perf_event_nmi_handler) took too long to run: 162.351 msecs
2025-12-04T12:52:23.694900+01:00 C13;perf: interrupt took too long (1268455 > 1198553), lowering kernel.perf_event_max_sample_rate to 1000
2025-12-04T12:52:26.318173+01:00 C11;INFO: NMI handler (perf_event_nmi_handler) took too long to run: 182.852 msecs
2025-12-04T12:52:28.940844+01:00 C0;perf: interrupt took too long (2079549 > 1585568), lowering kernel.perf_event_max_sample_rate to 1000
[crash, no further messages]
So perhaps acpi_php_check_bridge() is not the only place where a returned error
is incorrectly ignored.
Bert Karwatzki
Powered by blists - more mailing lists