lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ec99725df78fdd0fd9d4398d00fdebb84cb38ee6.camel@web.de>
Date: Fri, 05 Dec 2025 11:05:16 +0100
From: Bert Karwatzki <spasswolf@....de>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: Christian König <christian.koenig@....com>, "Mario
 Limonciello (AMD) (kernel.org)"	 <superm1@...nel.org>,
 linux-kernel@...r.kernel.org, linux-next@...r.kernel.org, 
	regressions@...ts.linux.dev, linux-pci@...r.kernel.org, 
	linux-acpi@...r.kernel.org, "Rafael J . Wysocki"
 <rafael.j.wysocki@...el.com>, 	acpica-devel@...ts.linux.dev, Robert Moore
 <robert.moore@...el.com>, Saket Dumbre <saket.dumbre@...el.com>,
 spasswolf@....de, Bjorn Helgaas <bhelgaas@...gle.com>
Subject: Re: Crash during resume of pcie bridge due to incorrect error
 handling

I've got good and bad news on this. The good news first:

I was able to get through a failed resume without a crash with the following
changes to the pci resume process:

1: pci_pm_bridge_power_up_actions() returns -ENODV when pci_bridge_wait_for_secondary()
returns an error (pci_walk_bus_debug() is just pci_walk_bus() with added printk()s),
and also sets state to PCI_D3cold on failure:

static int pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
{
	int ret;

	dev_info(&pci_dev->dev, "%s %d\n", __func__, __LINE__);
	ret = pci_bridge_wait_for_secondary_bus(pci_dev, "resume");
	dev_info(&pci_dev->dev, "%s %d: ret = %d\n", __func__, __LINE__, ret);
	if (ret) {
		/*
		 * The downstream link failed to come up, so mark the
		 * devices below as disconnected to make sure we don't
		 * attempt to resume them.
		 */
		pci_walk_bus_debug(pci_dev->subordinate, pci_dev_set_disconnected,
			     NULL);
		pci_update_current_state(pci_dev, PCI_D3cold);
		dev_info(&pci_dev->dev, "%s: bridge failed to power up\n", __func__);
		return -ENODEV;
	}

	/*
	 * When powering on a bridge from D3cold, the whole hierarchy may be
	 * powered on into D0uninitialized state, resume them to give them a
	 * chance to suspend again
	 */
	pci_resume_bus(pci_dev->subordinate);
	return 0;
}

2. pci_pm_runtime_resume() has an early exit if pci_pm_bridge_power_up_actions() fails:


static int pci_pm_runtime_resume(struct device *dev)
{
	struct pci_dev *pci_dev = to_pci_dev(dev);
	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
	pci_power_t prev_state = pci_dev->current_state;
	int error = 0;

	/*
	 * Restoring config space is necessary even if the device is not bound
	 * to a driver because although we left it in D0, it may have gone to
	 * D3cold when the bridge above it runtime suspended.
	 */
	pci_pm_default_resume_early(pci_dev);
	pci_resume_ptm(pci_dev);

	if (!pci_dev->driver)
		return 0;

	pci_fixup_device(pci_fixup_resume_early, pci_dev);
	pci_pm_default_resume(pci_dev);

	if (prev_state == PCI_D3cold) {
		error = pci_pm_bridge_power_up_actions(pci_dev);
		if (error)
			return error;
	}
	[...]
}

3. acpiphp_check_bridge() has an early exit if pm_runtime_get_sync()
returns an error (pm_runtime_get_sync() basically returns the result of
pci_pm_runtime_resume() which is called as a callback in rpm_resume())

static void acpiphp_check_bridge(struct acpiphp_bridge *bridge)
{
	struct acpiphp_slot *slot;
	int ret;

	/* Bail out if the bridge is going away. */
	if (bridge->is_going_away)
		return;

	if (bridge->pci_dev) {
		ret = pm_runtime_get_sync(&bridge->pci_dev->dev);
		if (ret < 0) {
			dev_info(&bridge->pci_dev->dev, "%s: pm_runtime_get_sync() failed with ret = %d\n", __func__, ret);
			return;
		}
	}

With these changes I get the following messages in dmesg when acpi_power_on_list() fails
for the pci bridge:

2025-12-05T01:58:53.260991+01:00 lisa kernel: [ T1772] acpi_power_on_list_debug 649: result = -19
2025-12-05T01:58:53.260993+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: pci_pm_default_resume_early 567
2025-12-05T01:58:53.260994+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: pci_pm_bridge_power_up_actions 576
2025-12-05T01:58:54.282032+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
2025-12-05T01:58:55.282033+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: retraining failed
2025-12-05T01:58:55.282053+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec
2025-12-05T01:58:55.282055+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: pci_pm_bridge_power_up_actions 578: ret = -25
2025-12-05T01:58:55.282057+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: __pci_walk_bus_debug 0
2025-12-05T01:58:55.282058+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: pci_dev_set_disconnected: 0
2025-12-05T01:58:55.282060+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: pci_dev_set_disconnected: 1
2025-12-05T01:58:55.282063+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: __pci_walk_bus_debug 1: ret = 0
2025-12-05T01:58:55.282065+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: __pci_walk_bus_debug 1.0
2025-12-05T01:58:55.282067+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: __pci_walk_bus_debug 0
2025-12-05T01:58:55.282070+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: pci_dev_set_disconnected: 0
2025-12-05T01:58:55.282073+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: pci_dev_set_disconnected: 1
2025-12-05T01:58:55.282101+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: __pci_walk_bus_debug 1: ret = 0
2025-12-05T01:58:55.282103+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: __pci_walk_bus_debug 1.0
2025-12-05T01:58:55.282105+01:00 lisa kernel: [ T1772] amdgpu 0000:03:00.0: __pci_walk_bus_debug 0
2025-12-05T01:58:55.282107+01:00 lisa kernel: [ T1772] amdgpu 0000:03:00.0: pci_dev_set_disconnected: 0
2025-12-05T01:58:55.282109+01:00 lisa kernel: [ T1772] amdgpu 0000:03:00.0: pci_dev_set_disconnected: 1
2025-12-05T01:58:55.282111+01:00 lisa kernel: [ T1772] amdgpu 0000:03:00.0: __pci_walk_bus_debug 1: ret = 0
2025-12-05T01:58:55.282113+01:00 lisa kernel: [ T1772] snd_hda_intel 0000:03:00.1: __pci_walk_bus_debug 0
2025-12-05T01:58:55.282115+01:00 lisa kernel: [ T1772] snd_hda_intel 0000:03:00.1: pci_dev_set_disconnected: 0
2025-12-05T01:58:55.282116+01:00 lisa kernel: [ T1772] snd_hda_intel 0000:03:00.1: pci_dev_set_disconnected: 1
2025-12-05T01:58:55.282118+01:00 lisa kernel: [ T1772] snd_hda_intel 0000:03:00.1: __pci_walk_bus_debug 1: ret = 0
2025-12-05T01:58:55.282120+01:00 lisa kernel: [ T1772] __pci_walk_bus_debug: ret = 0
2025-12-05T01:58:55.282122+01:00 lisa kernel: [ T1772] pcieport 0000:02:00.0: __pci_walk_bus_debug 1.1: ret = 0
2025-12-05T01:58:55.282124+01:00 lisa kernel: [ T1772] __pci_walk_bus_debug: ret = 0
2025-12-05T01:58:55.282126+01:00 lisa kernel: [ T1772] pcieport 0000:01:00.0: __pci_walk_bus_debug 1.1: ret = 0
2025-12-05T01:58:55.282128+01:00 lisa kernel: [ T1772] __pci_walk_bus_debug: ret = 0
2025-12-05T01:58:55.282130+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: pci_pm_bridge_power_up_actions: bridge failed to power up
2025-12-05T01:58:55.282131+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 916 retval = -19
2025-12-05T01:58:55.282133+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 922
2025-12-05T01:58:55.282135+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 937
2025-12-05T01:58:55.282137+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 947
2025-12-05T01:58:55.282138+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 950
2025-12-05T01:58:55.282140+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -19
2025-12-05T01:58:55.282141+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:58:55.523974+01:00 lisa kernel: [T192257] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:21.896980+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:21.896986+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:22.141965+01:00 lisa kernel: [T192400] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:51.903964+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:51.903974+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T01:59:52.153973+01:00 lisa kernel: [T192554] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:21.915965+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:21.915971+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:22.161971+01:00 lisa kernel: [T192706] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:51.922981+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:51.922988+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:52.175975+01:00 lisa kernel: [T192899] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:52.931997+01:00 lisa kernel: [T180476] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:00:52.932008+01:00 lisa kernel: [T180476] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:01:21.933961+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:21.933965+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:22.172962+01:00 lisa kernel: [T193054] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:23.001998+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:23.002009+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:01:51.942980+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:51.942986+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:52.188969+01:00 lisa kernel: [T193209] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:52.958995+01:00 lisa kernel: [T188262] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:01:52.959009+01:00 lisa kernel: [T188262] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:02:21.951971+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:21.951980+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:22.205974+01:00 lisa kernel: [T193359] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:23.117988+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:23.118000+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:02:51.963970+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:51.963976+01:00 lisa kernel: [ T1772] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:52.209969+01:00 lisa kernel: [T193512] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:53.465055+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:53.465074+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
2025-12-05T02:02:57.133996+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: rpm_resume 957: retval = -22
2025-12-05T02:02:57.134006+01:00 lisa kernel: [T192396] pcieport 0000:00:01.1: acpiphp_check_bridge: pm_runtime_get_sync() failed with ret = -22
These lines continue, all further resumes fail but no crash occurs.

The complete debug code used is here (it's rather messy though):
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume?ref_type=heads
I've not tested yet, If the fix above also works on a pure v6.14 without all the debug patches.


The bad news is I've encountered another version of the crash. (at least 90% of these crashes occur after 
acpi_power_on() fails as above, but at least 2 crashes happened on suspend) Now a third version has appeared
which occured after an SMU resume failure in amdgpu (up to now SMU failure have appeared but they did not
result in a crash like this)

2025-12-04T12:52:19.589753+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: SMU is resuming...#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.589753+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0,
version = 0x003b3100 (59.49.0)#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.589753+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:6 param:0x00000000 message:EnableAllSmuFeatures?#012
SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: Failed to enable requested dpm features!#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: resume of IP block <smu> failed -121#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-121).#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0
2025-12-04T12:52:19.861952+01:00 T39596;amdgpu 0000:03:00.0: pci_pm_runtime_resume 1380 error = -121 state = 0x0#012 SUBSYSTEM=pci#012 DEVICE=+pci:0000:03:00.0

From here on the system hangs for quite a long time (9s!) while these error messages appear on netconsole:

2025-12-04T12:52:20.870639+01:00 C0;INFO: NMI handler (perf_event_nmi_handler) took too long to run: 122.609 msecs
2025-12-04T12:52:21.072400+01:00 C0;perf: interrupt took too long (958843 > 2500), lowering kernel.perf_event_max_sample_rate to 1000
2025-12-04T12:52:21.879308+01:00 C5;INFO: NMI handler (perf_event_nmi_handler) took too long to run: 138.016 msecs
2025-12-04T12:52:23.493066+01:00 C13;INFO: NMI handler (perf_event_nmi_handler) took too long to run: 162.351 msecs
2025-12-04T12:52:23.694900+01:00 C13;perf: interrupt took too long (1268455 > 1198553), lowering kernel.perf_event_max_sample_rate to 1000
2025-12-04T12:52:26.318173+01:00 C11;INFO: NMI handler (perf_event_nmi_handler) took too long to run: 182.852 msecs
2025-12-04T12:52:28.940844+01:00 C0;perf: interrupt took too long (2079549 > 1585568), lowering kernel.perf_event_max_sample_rate to 1000
[crash, no further messages]

So perhaps acpi_php_check_bridge() is not the only place where a returned error
is incorrectly ignored.

Bert Karwatzki

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ