[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87tt2vnzsv.ffs@tglx>
Date: Tue, 29 Jul 2025 10:53:20 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Yipeng Zou <zouyipeng@...wei.com>, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
peterz@...radead.org, sohil.mehta@...el.com, rui.zhang@...el.com,
arnd@...db.de, yuntao.wang@...ux.dev, linux-kernel@...r.kernel.org
Cc: zouyipeng@...wei.com
Subject: Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
On Sun, Jul 27 2025 at 22:01, Thomas Gleixner wrote:
> On Wed, Jun 04 2025 at 08:33, Yipeng Zou wrote:
>> Recently, A issue has been reported that CPU hang in x86 VM.
>>
>> The CPU halted during Kdump likely due to IPI issues when one CPU was
>> rebooting and another was in Kdump:
>>
>> CPU0 CPU2
>> ======================== ======================
>> reboot Panic
>> machine shutdown Kdump
>> machine shutdown
>> stop other cpus
>> stop other cpus
>> ... ...
>> local_irq_disable local_irq_disable
>> send_IPIs(REBOOT) [critical regions]
>> [critical regions] 1) send_IPIs(REBOOT)
>
> After staring more at it, this makes absolutely no sense at all.
>
> stop_other_cpus() does:
>
> /* Only proceed if this is the first CPU to reach this code */
> old_cpu = -1;
> this_cpu = smp_processor_id();
> if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
> return;
>
> So CPU2 _cannot_ reach the code, which issues the reboot IPIs, because
> at that point @stopping_cpu == 0 ergo the cmpxchg() fails.
>
> So what actually happens in this case is:
>
> CPU0 CPU2
> ======================== ======================
> reboot Panic
> machine shutdown Kdump
> machine_crash_shutdown()
> stop other cpus local_irq_disable()
> try_cmpxchg() succeeds stop other cpus
> ... try_cmpxchg() fails
> send_IPIs(REBOOT) --> REBOOT vector becomes pending in IRR
> wait timeout
But looking even deeper. machine_crash_shutdown() does not end up in
stop_other_cpus() at all. It immediately uses the NMI shutdown. There
are still a few inconsistencies in that code, but they are not really
critical.
So the actual scenario is:
CPU0 CPU2
======================== ======================
reboot Panic
machine shutdown Kdump
machine_crash_shutdown()
stop other cpus
send_IPIs(REBOOT) --> REBOOT vector becomes pending in IRR
wait timeout
send NMI stop
NMI -> CPU stop
jump to crash kernel
So the patch I gave you should handle the reboot vector pending in IRR
gracefully. Can you please give it a try?
Thanks,
tglx
Powered by blists - more mailing lists