lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87ecu1pfnn.ffs@tglx>
Date: Sun, 27 Jul 2025 22:01:00 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Yipeng Zou <zouyipeng@...wei.com>, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
 peterz@...radead.org, sohil.mehta@...el.com, rui.zhang@...el.com,
 arnd@...db.de, yuntao.wang@...ux.dev, linux-kernel@...r.kernel.org
Cc: zouyipeng@...wei.com
Subject: Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump

On Wed, Jun 04 2025 at 08:33, Yipeng Zou wrote:
> Recently, A issue has been reported that CPU hang in x86 VM.
>
> The CPU halted during Kdump likely due to IPI issues when one CPU was
> rebooting and another was in Kdump:
>
> CPU0			  CPU2
> ========================  ======================
> reboot			  Panic
> machine shutdown	  Kdump
> 			  machine shutdown
> stop other cpus
> 			  stop other cpus
> ...			  ...
> local_irq_disable	  local_irq_disable
> send_IPIs(REBOOT)	  [critical regions]
> [critical regions]	  1) send_IPIs(REBOOT)

After staring more at it, this makes absolutely no sense at all.

stop_other_cpus() does:

	/* Only proceed if this is the first CPU to reach this code */
	old_cpu = -1;
	this_cpu = smp_processor_id();
	if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
		return;

So CPU2 _cannot_ reach the code, which issues the reboot IPIs, because
at that point @stopping_cpu == 0 ergo the cmpxchg() fails.

So what actually happens in this case is:

CPU0			  CPU2
========================  ======================
reboot			  Panic
machine shutdown	  Kdump
			  machine_crash_shutdown()
stop other cpus           local_irq_disable()
try_cmpxchg() succeeds	  stop other cpus
...		          try_cmpxchg() fails	  
send_IPIs(REBOOT)	  --> REBOOT vector becomes pending in IRR
wait timeout

And from there on everything becomes a lottery as CPU0 continues to
execute and CPU2 proceeds and jumps into the crash kernel...

This whole logic is broken...

Nevertheless the patch I sent earlier is definitely making things more
robust, but it won't solve your particular problem.

Thanks,

        tglx





Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ