[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250625194322.GGaFxRWqx0WbE90k6N@fat_crate.local>
Date: Wed, 25 Jun 2025 21:43:22 +0200
From: Borislav Petkov <bp@...en8.de>
To: JP Kobryn <inwardvessel@...il.com>
Cc: tony.luck@...el.com, tglx@...utronix.de, mingo@...hat.com,
dave.hansen@...ux.intel.com, hpa@...or.com, aijay@...a.com,
x86@...nel.org, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH] mce: include cmci during intel feature clearing
On Tue, Jun 17, 2025 at 02:47:52PM -0700, JP Kobryn wrote:
> It was found that after a kexec on an intel CPU, MCE reporting was no
> longer active. The root cause has been found to be that ownership of CMCI
> banks is not cleared during the shutdown phase. As a result, when CPU's
> come back online, they are unable to rediscover these occupied banks. If we
> clear these CPU associations before booting into the new kernel, the CMCI
> banks can be reclaimed and MCE reporting will become functional once more.
>
> The existing code does seem to have the intention of clearing MCE-related
> features via mcheck_cpu_clear(). During a kexec reboot, there are two
> sequences that reach a call to mcheck_cpu_clear(). They are:
>
> 1) Stopping other (remote) CPU's via IPI:
> native_machine_shutdown()
> stop_other_cpus()
> smp_ops.stop_other_cpus(1)
> x86 smp: native_stop_other_cpus(1)
> apic_send_IPI_allbutself(REBOOT_VECTOR)
>
> ...IPI is received on remote CPU's and IDT sysvec_reboot invoked:
> stop_this_cpu()
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
>
> 2) Seqence of stopping the active CPU (the one performing the kexec):
> native_machine_shutdown()
> stop_other_cpus()
> smp_ops.stop_other_cpus(1)
> x86 smp: native_stop_other_cpus(1)
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
>
> In both cases, the call to mcheck_cpu_clear() leads to the vendor specific
> call to intel_feature_clear():
>
> mcheck_cpu_clear(this_ptr_cpu(&cpu_info))
> __mcheck_cpu_clear_vendor(c)
> switch (c->x86_vendor)
> case X86_VENDOR_INTEL:
> mce_intel_feature_clear(c)
>
> Now looking at the pair of functions mce_intel_feature_{init,clear}, there
> are 3 MCE features setup on the init side:
>
> mce_intel_feature_init(c)
> intel_init_cmci()
> intel_init_lmce()
> intel_imc_init(c)
>
> On the other side in the clear function, only one of these features is
> actually cleared:
>
> mce_intel_feature_clear(c)
> intel_clear_lmce()
>
> Just focusing on the feature pertaining to the root cause of the kexec
> issue, there would be a benefit if we additionally cleared the CMCI feature
> within this routine - the banks would be free for acquisition on the boot
> up side of a kexec. This patch adds the call to clear CMCI to this intel
> routine.
Please:
- shorten this commit message - there really is no need to explain in such
detail that mcheck_cpu_clear() has simply forgotten to clear CMCI banks
too.
- run it through a spellchecker
- drop all personal pronouns
- write it in imperative tone
Some hints:
Section "2) Describe your changes" in
Documentation/process/submitting-patches.rst for more details.
Also, see section "Changelog" in
Documentation/process/maintainer-tip.rst
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists