[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFw8rWwTVw85cavh@agluck-desk3>
Date: Wed, 25 Jun 2025 11:15:09 -0700
From: "Luck, Tony" <tony.luck@...el.com>
To: JP Kobryn <inwardvessel@...il.com>
Cc: bp@...en8.de, tglx@...utronix.de, mingo@...hat.com,
dave.hansen@...ux.intel.com, hpa@...or.com, aijay@...a.com,
x86@...nel.org, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH] mce: include cmci during intel feature clearing
On Tue, Jun 17, 2025 at 02:47:52PM -0700, JP Kobryn wrote:
> It was found that after a kexec on an intel CPU, MCE reporting was no
> longer active. The root cause has been found to be that ownership of CMCI
> banks is not cleared during the shutdown phase. As a result, when CPU's
> come back online, they are unable to rediscover these occupied banks. If we
> clear these CPU associations before booting into the new kernel, the CMCI
> banks can be reclaimed and MCE reporting will become functional once more.
>
> The existing code does seem to have the intention of clearing MCE-related
> features via mcheck_cpu_clear(). During a kexec reboot, there are two
> sequences that reach a call to mcheck_cpu_clear(). They are:
>
> 1) Stopping other (remote) CPU's via IPI:
> native_machine_shutdown()
> stop_other_cpus()
> smp_ops.stop_other_cpus(1)
> x86 smp: native_stop_other_cpus(1)
> apic_send_IPI_allbutself(REBOOT_VECTOR)
>
> ...IPI is received on remote CPU's and IDT sysvec_reboot invoked:
> stop_this_cpu()
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
>
> 2) Seqence of stopping the active CPU (the one performing the kexec):
> native_machine_shutdown()
> stop_other_cpus()
> smp_ops.stop_other_cpus(1)
> x86 smp: native_stop_other_cpus(1)
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
>
> In both cases, the call to mcheck_cpu_clear() leads to the vendor specific
> call to intel_feature_clear():
>
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> __mcheck_cpu_clear_vendor(c)
> switch (c->x86_vendor)
> case X86_VENDOR_INTEL:
> mce_intel_feature_clear(c)
>
> Now looking at the pair of functions mce_intel_feature_{init,clear}, there
> are 3 MCE features setup on the init side:
>
> mce_intel_feature_init(c)
> intel_init_cmci()
> intel_init_lmce()
> intel_imc_init(c)
>
> On the other side in the clear function, only one of these features is
> actually cleared:
>
> mce_intel_feature_clear(c)
> intel_clear_lmce()
>
> Just focusing on the feature pertaining to the root cause of the kexec
> issue, there would be a benefit if we additionally cleared the CMCI feature
> within this routine - the banks would be free for acquisition on the boot
> up side of a kexec. This patch adds the call to clear CMCI to this intel
> routine.
>
> Reported-by: Aijay Adams <aijay@...a.com>
> Signed-off-by: JP Kobryn <inwardvessel@...il.com>
LGTM
Reviewed-by: Tony Luck <tony.luck@...el.com>
> ---
> arch/x86/kernel/cpu/mce/intel.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
> index efcf21e9552e..9b149b9c4109 100644
> --- a/arch/x86/kernel/cpu/mce/intel.c
> +++ b/arch/x86/kernel/cpu/mce/intel.c
> @@ -478,6 +478,7 @@ void mce_intel_feature_init(struct cpuinfo_x86 *c)
> void mce_intel_feature_clear(struct cpuinfo_x86 *c)
> {
> intel_clear_lmce();
> + cmci_clear();
I particularly like that you found a one-line fix!
> }
>
> bool intel_filter_mce(struct mce *m)
> --
> 2.47.1
>
Powered by blists - more mailing lists