lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260113094129.3357-1-spasswolf@web.de>
Date: Tue, 13 Jan 2026 10:41:25 +0100
From: Bert Karwatzki <spasswolf@....de>
To: linux-kernel@...r.kernel.org
Cc: Bert Karwatzki <spasswolf@....de>,
	linux-next@...r.kernel.org,
	Mario Limonciello <mario.limonciello@....com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Clark Williams <clrkwllms@...nel.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Christian König <christian.koenig@....com>,
	regressions@...ts.linux.dev,
	linux-pci@...r.kernel.org,
	linux-acpi@...r.kernel.org,
	"Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
	acpica-devel@...ts.linux.dev,
	Robert Moore <robert.moore@...el.com>,
	Saket Dumbre <saket.dumbre@...el.com>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Clemens Ladisch <clemens@...isch.de>,
	Jinchao Wang <wangjinchao600@...il.com>,
	Yury Norov <yury.norov@...il.com>,
	Anna Schumaker <anna.schumaker@...cle.com>,
	Baoquan He <bhe@...hat.com>,
	"Darrick J. Wong" <djwong@...nel.org>,
	Dave Young <dyoung@...hat.com>,
	Doug Anderson <dianders@...omium.org>,
	"Guilherme G. Piccoli" <gpiccoli@...lia.com>,
	Helge Deller <deller@....de>,
	Ingo Molnar <mingo@...nel.org>,
	Jason Gunthorpe <jgg@...pe.ca>,
	Joanthan Cameron <Jonathan.Cameron@...wei.com>,
	Joel Granados <joel.granados@...nel.org>,
	John Ogness <john.ogness@...utronix.de>,
	Kees Cook <kees@...nel.org>,
	Li Huafei <lihuafei1@...wei.com>,
	"Luck, Tony" <tony.luck@...el.com>,
	Luo Gengkun <luogengkun@...weicloud.com>,
	Max Kellermann <max.kellermann@...os.com>,
	Nam Cao <namcao@...utronix.de>,
	oushixiong <oushixiong@...inos.cn>,
	Petr Mladek <pmladek@...e.com>,
	Qianqiang Liu <qianqiang.liu@....com>,
	Sergey Senozhatsky <senozhatsky@...omium.org>,
	Sohil Mehta <sohil.mehta@...el.com>,
	Tejun Heo <tj@...nel.org>,
	Thomas Gleinxer <tglx@...utronix.de>,
	Thomas Zimemrmann <tzimmermann@...e.de>,
	Thorsten Blum <thorsten.blum@...ux.dev>,
	Ville Syrjala <ville.syrjala@...ux.intel.com>,
	Vivek Goyal <vgoyal@...hat.com>,
	Yicong Yang <yangyicong@...ilicon.com>,
	Yunhui Cui <cuiyunhui@...edance.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	W_Armin@....de
Subject: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y

The investigation into this bug has taken yet another dramatic turn of events.
I'll summarize what I've found so far:

On my MSI Alpha 15 Dual GPU Laptop with the following hardware

$ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge [1022:1634]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus [1022:1635]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 51)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0 [1022:166a]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1 [1022:166b]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2 [1022:166c]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3 [1022:166d]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4 [1022:166e]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5 [1022:166f]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6 [1022:1670]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7 [1022:1671]
01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
04:00.0 Network controller [0280]: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz [14c3:0608]
05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
06:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] [2646:5013] (rev 01)
07:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology P1 NVMe PCIe SSD[Frampton] [c0a9:2263] (rev 03)
08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
08:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
08:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
08:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
08:00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] Audio Coprocessor [1022:15e2] (rev 01)
08:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller [1022:15e3]
08:00.7 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub [1022:15e4]

I've been encountering random crashes when resuming the discrete GPU

03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c3)

These random crashes can actually be provoked by the following script

#!/bin/bash
for i in {0..10000}
do
	echo $i
	evolution &
	sleep 5
	killall evolution
	sleep 5
done

or

#!/bin/bash

while :
do
	DRI_PRIME=1 glxinfo > /dev/null
	sleep 10
done

though it still takes between 2 and 5 hours to trigger a crash.

The actual crash happens when resuming the PCI bridge to which the GPU is connected

00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]

As there are no error message for this in dmesg (using netconsole) and neiter kdump or kgdboE
work in this case I used printk()s in the resume code to locate try and locate the
exact place of the crash. I found that the last shown printk message before the crash
was in the ACPICA interpreter in acpi_ex_system_memory_space_handler():

acpi_ex_system_memory_space_handler(...)
{
	[...]
	/*
	 * Perform the memory read or write
	 *
	 * Note: For machines that do not support non-aligned transfers, the target
	 * address was checked for alignment above. We do not attempt to break the
	 * transfer up into smaller (byte-size) chunks because the AML specifically
	 * asked for a transfer width that the hardware may require.
	 */
	switch (function) {
	case ACPI_READ:
		if (debug)
			printk(KERN_INFO "%s %d value = %px\n", __func__, __LINE__, value);

		*value = 0;
		if (debug)
			printk(KERN_INFO "%s %d\n", __func__, __LINE__);
		switch (bit_width) {
		case 8:

			*value = (u64)ACPI_GET8(logical_addr_ptr);
			break;

		case 16:

			*value = (u64)ACPI_GET16(logical_addr_ptr);
			break;

		case 32:

			if (debug) // This is the last message shown on netconsole!
				printk(KERN_INFO "%s %d: logical_addr_ptr = %px\n", __func__, __LINE__, logical_addr_ptr);
			*value = (u64)ACPI_GET32(logical_addr_ptr);
			if (debug)
				printk(KERN_INFO "%s %d\n", __func__, __LINE__);
			break;

		case 64:

			*value = (u64)ACPI_GET64(logical_addr_ptr);
			break;

		default:

			/* bit_width was already validated */

			break;
		}
		break;

	case ACPI_WRITE:
		if (debug)
			printk(KERN_INFO "%s %d\n", __func__, __LINE__);

		switch (bit_width) {
			[...]
		}
		break;

	default:
		if (debug)
			printk(KERN_INFO "%s %d\n", __func__, __LINE__);

		status = AE_BAD_PARAMETER;
		break;
	}

	[...]
}

The memory which ACPICA is trying to read is at physical address 0xf0100000,
which belongs to the PCI ECAM memory on my machine (from /proc/iomem):

f0000000-fcffffff : PCI Bus 0000:00
  f0000000-f7ffffff : PCI ECAM 0000 [bus 00-7f]
    f0000000-f7ffffff : pnp 00:00

According to the PCIe specification the failing address 0xf0100000 is belongs to bus 01:
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge [1022:1633]
so the error occurs when trying to read ECAM memory that belongs to a failing device.

The code used to get to this can be found here (based on v6.14) (It's rather messy though)
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume_4?ref_type=heads
and more details on the investigation can be found here:
https://github.com/acpica/acpica/issues/1060

So we seem to have a read to IO memory where the physical address fails as the
device stopped working. I tried to consult the Documentation  (AMD64 Architecture Programmer’s Manual
Volume 2: System Programming), to find if an execption is raised in this case, but the Documentation
doesn't consider this case.

So I put printk()s in most of the execption handlers to find out if there is a chance
to catch this failed memory access and work around it. The code used in this investigation can be found here:
https://gitlab.freedesktop.org/spasswolf/linux-stable/-/commits/amdgpu_suspend_resume_fault_handler?ref_type=heads

The result from this is that after the debug messages from the ACPICA interpreter
stop (where I previously thought a crash had occured), there are message from exc_nmi()
and the functions called by it.

Here's what I've found for a normal NMI:
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 0
2026-01-12T04:23:56.396721+01:00 C10;exc_nmi: 10.3
2026-01-12T04:23:56.396721+01:00 C10;default_do_nmi 
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: type=0x0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 0
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 1
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 2
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: 2
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: 2.6
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: 0
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:23:56.396721+01:00 C10;watchdog_overflow_callback: 0
2026-01-12T04:23:56.396721+01:00 C10;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:23:56.396721+01:00 C10;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:23:56.396721+01:00 C10;read_hpet: 0
2026-01-12T04:23:56.396721+01:00 C10;read_hpet: 0.1
2026-01-12T04:23:56.396721+01:00 C10;timekeeping_cycles_to_ns_debug: 0
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 0
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 1
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 2
2026-01-12T04:23:56.396721+01:00 C10;watchdog_check_timestamp: 3
2026-01-12T04:23:56.396721+01:00 C10;watchdog_overflow_callback: 1
2026-01-12T04:23:56.396721+01:00 C10;watchdog_overflow_callback: 2
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: 7
2026-01-12T04:23:56.396721+01:00 C10;__perf_event_overflow: ret=0x0
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: 2.7
2026-01-12T04:23:56.396721+01:00 C10;x86_pmu_handle_irq: handled=0x1
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: ret = 1
2026-01-12T04:23:56.396721+01:00 C10;perf_event_nmi_handler: 3
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: thishandled=0x1
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a=0xffffffffa1623040
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a->handler=nmi_cpu_backtrace_handler+0x0/0x20
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: thishandled=0x0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a=0xffffffffa16148e0
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: a->handler=perf_ibs_nmi_handler+0x0/0x60
2026-01-12T04:23:56.396721+01:00 C10;nmi_handle: thishandled=0x0
2026-01-12T04:23:56.396721+01:00 C10;exc_nmi: 10.4
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 11
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 12
2026-01-12T04:23:56.396721+01:00 T279584;exc_nmi: 13

Here the interrupt handling works without triggering additional NMIs.

Here's the result in case of the crash:
2026-01-12T04:24:36.809904+01:00 T1510;acpi_ex_system_memory_space_handler 255: logical_addr_ptr = ffffc066977b3000
2026-01-12T04:24:36.846170+01:00 C14;exc_nmi: 0
2026-01-12T04:24:36.960760+01:00 C14;exc_nmi: 10.3
2026-01-12T04:24:36.960760+01:00 C14;default_do_nmi 
2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: type=0x0
2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:36.960760+01:00 C14;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 0
2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 1
2026-01-12T04:24:36.960760+01:00 C14;perf_event_nmi_handler: 2
2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2
2026-01-12T04:24:36.960760+01:00 C14;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 0
2026-01-12T04:24:36.960760+01:00 C14;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:36.960760+01:00 C14;watchdog_overflow_callback: 0
2026-01-12T04:24:36.960760+01:00 C14;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:36.960760+01:00 C14;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0
2026-01-12T04:24:36.960760+01:00 C14;read_hpet: 0.1
2026-01-12T04:24:36.960760+01:00 T0;exc_nmi: 0
2026-01-12T04:24:38.674625+01:00 C13;exc_nmi: 10.3
2026-01-12T04:24:38.674625+01:00 C13;default_do_nmi 
2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: type=0x0
2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:38.674625+01:00 C13;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 0
2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 1
2026-01-12T04:24:38.674625+01:00 C13;perf_event_nmi_handler: 2
2026-01-12T04:24:38.674625+01:00 C13;x86_pmu_handle_irq: 2
2026-01-12T04:24:38.674625+01:00 C13;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:38.674625+01:00 C13;__perf_event_overflow: 0
2026-01-12T04:24:38.674625+01:00 C13;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:38.674625+01:00 C13;watchdog_overflow_callback: 0
2026-01-12T04:24:38.674625+01:00 C13;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:38.674625+01:00 C13;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:38.674625+01:00 C13;read_hpet: 0
2026-01-12T04:24:38.674625+01:00 C13;read_hpet: 0.1
2026-01-12T04:24:38.674625+01:00 T0;exc_nmi: 0
2026-01-12T04:24:39.355101+01:00 C2;exc_nmi: 10.3
2026-01-12T04:24:39.355101+01:00 C2;default_do_nmi 
2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: type=0x0
2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:39.355101+01:00 C2;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 0
2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 1
2026-01-12T04:24:39.355101+01:00 C2;perf_event_nmi_handler: 2
2026-01-12T04:24:39.355101+01:00 C2;x86_pmu_handle_irq: 2
2026-01-12T04:24:39.355101+01:00 C2;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:39.355101+01:00 C2;__perf_event_overflow: 0
2026-01-12T04:24:39.355101+01:00 C2;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:39.355101+01:00 C2;watchdog_overflow_callback: 0
2026-01-12T04:24:39.355101+01:00 C2;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:39.355101+01:00 C2;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:39.355101+01:00 C2;read_hpet: 0
2026-01-12T04:24:39.355101+01:00 C2;read_hpet: 0.1
2026-01-12T04:24:39.355101+01:00 T0;exc_nmi: 0
2026-01-12T04:24:39.410207+01:00 C0;exc_nmi: 10.3
2026-01-12T04:24:39.410207+01:00 C0;default_do_nmi 
2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: type=0x0
2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: a=0xffffffffa1612de0
2026-01-12T04:24:39.410207+01:00 C0;nmi_handle: a->handler=perf_event_nmi_handler+0x0/0xa6
2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 0
2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 1
2026-01-12T04:24:39.410207+01:00 C0;perf_event_nmi_handler: 2
2026-01-12T04:24:39.410207+01:00 C0;x86_pmu_handle_irq: 2
2026-01-12T04:24:39.410207+01:00 C0;x86_pmu_handle_irq: 2.6
2026-01-12T04:24:39.410207+01:00 C0;__perf_event_overflow: 0
2026-01-12T04:24:39.410207+01:00 C0;__perf_event_overflow: 6.99: overflow_handler=watchdog_overflow_callback+0x0/0x10d
2026-01-12T04:24:39.410207+01:00 C0;watchdog_overflow_callback: 0
2026-01-12T04:24:39.410207+01:00 C0;__ktime_get_fast_ns_debug: 0.1
2026-01-12T04:24:39.410207+01:00 C0;tk_clock_read_debug: read=read_hpet+0x0/0xf0
2026-01-12T04:24:39.410207+01:00 C0;read_hpet: 0
2026-01-12T04:24:39.410207+01:00 C0;read_hpet: 0.1
2026-01-12T04:24:39.410207+01:00 T0;exc_nmi: 0

In the case of the crash the interrupt handler never returns because when accessing
the HPET another NMI is triggered. This goes on until a crash happens, probably because
of stack overflow.

One can work around this bug by disabling CONFIG_HARDLOCKUP_DETECT in .config, but
I've only tested this twice so far.

The behaviour described here seems to be similar to the bug that commit
3d5f4f15b778 ("watchdog: skip checks when panic is in progress") is fixing, but
this is actually a different bug as kernel 6.18 (which contains 3d5f4f15b778)
is also affected (I've conducted 5 tests with 6.18 so far and got 4 crashes (crashes occured
after (0.5h, 1h, 4.5h, 1.5h) of testing)). 
Nevertheless these look similar enough to CC the involved people.

Bert Karwatzki


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ