linux-kernel - Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87tt2vnzsv.ffs@tglx>
Date: Tue, 29 Jul 2025 10:53:20 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Yipeng Zou <zouyipeng@...wei.com>, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
 peterz@...radead.org, sohil.mehta@...el.com, rui.zhang@...el.com,
 arnd@...db.de, yuntao.wang@...ux.dev, linux-kernel@...r.kernel.org
Cc: zouyipeng@...wei.com
Subject: Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump

On Sun, Jul 27 2025 at 22:01, Thomas Gleixner wrote:

> On Wed, Jun 04 2025 at 08:33, Yipeng Zou wrote:
>> Recently, A issue has been reported that CPU hang in x86 VM.
>>
>> The CPU halted during Kdump likely due to IPI issues when one CPU was
>> rebooting and another was in Kdump:
>>
>> CPU0			  CPU2
>> ========================  ======================
>> reboot			  Panic
>> machine shutdown	  Kdump
>> 			  machine shutdown
>> stop other cpus
>> 			  stop other cpus
>> ...			  ...
>> local_irq_disable	  local_irq_disable
>> send_IPIs(REBOOT)	  [critical regions]
>> [critical regions]	  1) send_IPIs(REBOOT)
>
> After staring more at it, this makes absolutely no sense at all.
>
> stop_other_cpus() does:
>
> 	/* Only proceed if this is the first CPU to reach this code */
> 	old_cpu = -1;
> 	this_cpu = smp_processor_id();
> 	if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
> 		return;
>
> So CPU2 _cannot_ reach the code, which issues the reboot IPIs, because
> at that point @stopping_cpu == 0 ergo the cmpxchg() fails.
>
> So what actually happens in this case is:
>
> CPU0			  CPU2
> ========================  ======================
> reboot			  Panic
> machine shutdown	  Kdump
> 			  machine_crash_shutdown()
> stop other cpus           local_irq_disable()
> try_cmpxchg() succeeds	  stop other cpus
> ...		          try_cmpxchg() fails	  
> send_IPIs(REBOOT)	  --> REBOOT vector becomes pending in IRR
> wait timeout

But looking even deeper. machine_crash_shutdown() does not end up in
stop_other_cpus() at all. It immediately uses the NMI shutdown. There
are still a few inconsistencies in that code, but they are not really
critical.

So the actual scenario is:

CPU0			  CPU2
========================  ======================
reboot			  Panic
machine shutdown	  Kdump
			  machine_crash_shutdown()
stop other cpus           
send_IPIs(REBOOT)	  --> REBOOT vector becomes pending in IRR
wait timeout
                          send NMI stop
NMI -> CPU stop
                          jump to crash kernel

So the patch I gave you should handle the reboot vector pending in IRR
gracefully. Can you please give it a try?

Thanks,

        tglx