lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250625194322.GGaFxRWqx0WbE90k6N@fat_crate.local>
Date: Wed, 25 Jun 2025 21:43:22 +0200
From: Borislav Petkov <bp@...en8.de>
To: JP Kobryn <inwardvessel@...il.com>
Cc: tony.luck@...el.com, tglx@...utronix.de, mingo@...hat.com,
	dave.hansen@...ux.intel.com, hpa@...or.com, aijay@...a.com,
	x86@...nel.org, linux-edac@...r.kernel.org,
	linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH] mce: include cmci during intel feature clearing

On Tue, Jun 17, 2025 at 02:47:52PM -0700, JP Kobryn wrote:
> It was found that after a kexec on an intel CPU, MCE reporting was no
> longer active. The root cause has been found to be that ownership of CMCI
> banks is not cleared during the shutdown phase. As a result, when CPU's
> come back online, they are unable to rediscover these occupied banks. If we
> clear these CPU associations before booting into the new kernel, the CMCI
> banks can be reclaimed and MCE reporting will become functional once more.
> 
> The existing code does seem to have the intention of clearing MCE-related
> features via mcheck_cpu_clear(). During a kexec reboot, there are two
> sequences that reach a call to mcheck_cpu_clear(). They are:
> 
> 1) Stopping other (remote) CPU's via IPI:
> native_machine_shutdown()
> 	stop_other_cpus()
> 		smp_ops.stop_other_cpus(1)
> 		x86 smp: native_stop_other_cpus(1)
> 			apic_send_IPI_allbutself(REBOOT_VECTOR)
> 
> ...IPI is received on remote CPU's and IDT sysvec_reboot invoked:
> 	stop_this_cpu()
> 		mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> 
> 2) Seqence of stopping the active CPU (the one performing the kexec):
> native_machine_shutdown()
> 	stop_other_cpus()
> 		smp_ops.stop_other_cpus(1)
> 		x86 smp: native_stop_other_cpus(1)
> 			mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> 
> In both cases, the call to mcheck_cpu_clear() leads to the vendor specific
> call to intel_feature_clear():
> 
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> 	__mcheck_cpu_clear_vendor(c)
> 		switch (c->x86_vendor)
> 		case X86_VENDOR_INTEL:
> 			mce_intel_feature_clear(c)
> 
> Now looking at the pair of functions mce_intel_feature_{init,clear}, there
> are 3 MCE features setup on the init side:
> 
> mce_intel_feature_init(c)
> 	intel_init_cmci()
> 	intel_init_lmce()
> 	intel_imc_init(c)
> 
> On the other side in the clear function, only one of these features is
> actually cleared:
> 
> mce_intel_feature_clear(c)
> 	intel_clear_lmce()
> 
> Just focusing on the feature pertaining to the root cause of the kexec
> issue, there would be a benefit if we additionally cleared the CMCI feature
> within this routine - the banks would be free for acquisition on the boot
> up side of a kexec. This patch adds the call to clear CMCI to this intel
> routine.

Please:

 - shorten this commit message - there really is no need to explain in such
   detail that mcheck_cpu_clear() has simply forgotten to clear CMCI banks
   too.

 - run it through a spellchecker

 - drop all personal pronouns

 - write it in imperative tone

Some hints:

Section "2) Describe your changes" in
Documentation/process/submitting-patches.rst for more details.

Also, see section "Changelog" in
Documentation/process/maintainer-tip.rst

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ