[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b31a5b91-bc94-46ce-8191-c6576c04f05b@huawei.com>
Date: Sat, 26 Jul 2025 17:50:11 +0800
From: Yipeng Zou <zouyipeng@...wei.com>
To: <tglx@...utronix.de>, <mingo@...hat.com>, <bp@...en8.de>,
<dave.hansen@...ux.intel.com>, <x86@...nel.org>, <hpa@...or.com>,
<peterz@...radead.org>, <sohil.mehta@...el.com>, <rui.zhang@...el.com>,
<arnd@...db.de>, <yuntao.wang@...ux.dev>, <linux-kernel@...r.kernel.org>
Subject: Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
Hi Thomas:
I skipped sending the NMI in native_stop_other_cpus(), and the test
passed.
However, this change reverts the fix introduced by commit [1],
which was intended to handle cases where the reboot IPI is not properly
handled by all CPUs.
Given this, is there an alternative way to resolve the issue, or
can we simply mask the IPI directly at that point?
[1] 747d5a1bf293 ("x86/reboot: Always use NMI fallback when
shutdown via reboot vector IPI fails")
在 2025/6/4 16:33, Yipeng Zou 写道:
> Recently, A issue has been reported that CPU hang in x86 VM.
>
> The CPU halted during Kdump likely due to IPI issues when one CPU was
> rebooting and another was in Kdump:
>
> CPU0 CPU2
> ======================== ======================
> reboot Panic
> machine shutdown Kdump
> machine shutdown
> stop other cpus
> stop other cpus
> ... ...
> local_irq_disable local_irq_disable
> send_IPIs(REBOOT) [critical regions]
> [critical regions] 1) send_IPIs(REBOOT)
> wait timeout
> 2) send_IPIs(NMI);
> Halt,NMI context
> 3) lapic_shutdown [IPI is pending]
> ...
> second kernel start
> 4) init_bsp_APIC [IPI is pending]
> ...
> local irq enable
> Halt, IPI context
>
> In simple terms, when the Kdump jump to the second kernel, the IPI that
> was pending in the first kernel remains and is responded to by the
> second kernel.
>
> I was thinking maybe we need mask IPI in clear_local_APIC() to solve this
> problem. In that way, it will clear the pending IPI in both 3) and 4).
>
> I can't seem to find a solution in the SDM manual. I want to ask if this
> approach is feasible, or if there are other ways to fix the issue.
>
> Signed-off-by: Yipeng Zou <zouyipeng@...wei.com>
> ---
> arch/x86/kernel/apic/apic.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index d73ba5a7b623..68c41d579303 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1117,6 +1117,8 @@ void clear_local_APIC(void)
> }
> #endif
>
> + // Mask IPI here
> +
> /*
> * Clean APIC state for other OSs:
> */
--
Regards,
Yipeng Zou
Powered by blists - more mailing lists