lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFw8rWwTVw85cavh@agluck-desk3>
Date: Wed, 25 Jun 2025 11:15:09 -0700
From: "Luck, Tony" <tony.luck@...el.com>
To: JP Kobryn <inwardvessel@...il.com>
Cc: bp@...en8.de, tglx@...utronix.de, mingo@...hat.com,
	dave.hansen@...ux.intel.com, hpa@...or.com, aijay@...a.com,
	x86@...nel.org, linux-edac@...r.kernel.org,
	linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH] mce: include cmci during intel feature clearing

On Tue, Jun 17, 2025 at 02:47:52PM -0700, JP Kobryn wrote:
> It was found that after a kexec on an intel CPU, MCE reporting was no
> longer active. The root cause has been found to be that ownership of CMCI
> banks is not cleared during the shutdown phase. As a result, when CPU's
> come back online, they are unable to rediscover these occupied banks. If we
> clear these CPU associations before booting into the new kernel, the CMCI
> banks can be reclaimed and MCE reporting will become functional once more.
> 
> The existing code does seem to have the intention of clearing MCE-related
> features via mcheck_cpu_clear(). During a kexec reboot, there are two
> sequences that reach a call to mcheck_cpu_clear(). They are:
> 
> 1) Stopping other (remote) CPU's via IPI:
> native_machine_shutdown()
> 	stop_other_cpus()
> 		smp_ops.stop_other_cpus(1)
> 		x86 smp: native_stop_other_cpus(1)
> 			apic_send_IPI_allbutself(REBOOT_VECTOR)
> 
> ...IPI is received on remote CPU's and IDT sysvec_reboot invoked:
> 	stop_this_cpu()
> 		mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> 
> 2) Seqence of stopping the active CPU (the one performing the kexec):
> native_machine_shutdown()
> 	stop_other_cpus()
> 		smp_ops.stop_other_cpus(1)
> 		x86 smp: native_stop_other_cpus(1)
> 			mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> 
> In both cases, the call to mcheck_cpu_clear() leads to the vendor specific
> call to intel_feature_clear():
> 
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> 	__mcheck_cpu_clear_vendor(c)
> 		switch (c->x86_vendor)
> 		case X86_VENDOR_INTEL:
> 			mce_intel_feature_clear(c)
> 
> Now looking at the pair of functions mce_intel_feature_{init,clear}, there
> are 3 MCE features setup on the init side:
> 
> mce_intel_feature_init(c)
> 	intel_init_cmci()
> 	intel_init_lmce()
> 	intel_imc_init(c)
> 
> On the other side in the clear function, only one of these features is
> actually cleared:
> 
> mce_intel_feature_clear(c)
> 	intel_clear_lmce()
> 
> Just focusing on the feature pertaining to the root cause of the kexec
> issue, there would be a benefit if we additionally cleared the CMCI feature
> within this routine - the banks would be free for acquisition on the boot
> up side of a kexec. This patch adds the call to clear CMCI to this intel
> routine.
> 
> Reported-by: Aijay Adams <aijay@...a.com>
> Signed-off-by: JP Kobryn <inwardvessel@...il.com>

LGTM

Reviewed-by: Tony Luck <tony.luck@...el.com>

> ---
>  arch/x86/kernel/cpu/mce/intel.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
> index efcf21e9552e..9b149b9c4109 100644
> --- a/arch/x86/kernel/cpu/mce/intel.c
> +++ b/arch/x86/kernel/cpu/mce/intel.c
> @@ -478,6 +478,7 @@ void mce_intel_feature_init(struct cpuinfo_x86 *c)
>  void mce_intel_feature_clear(struct cpuinfo_x86 *c)
>  {
>  	intel_clear_lmce();
> +	cmci_clear();

I particularly like that you found a one-line fix!

>  }
>  
>  bool intel_filter_mce(struct mce *m)
> -- 
> 2.47.1
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ