[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ffcb59ff61de9b3189cf1f1cc2f331c5d0b54170.camel@intel.com>
Date: Thu, 26 Jun 2025 17:59:17 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Hansen, Dave" <dave.hansen@...el.com>, "Huang, Kai"
<kai.huang@...el.com>, "bp@...en8.de" <bp@...en8.de>, "peterz@...radead.org"
<peterz@...radead.org>, "hpa@...or.com" <hpa@...or.com>, "mingo@...hat.com"
<mingo@...hat.com>, "tglx@...utronix.de" <tglx@...utronix.de>,
"thomas.lendacky@....com" <thomas.lendacky@....com>
CC: "seanjc@...gle.com" <seanjc@...gle.com>, "x86@...nel.org"
<x86@...nel.org>, "sagis@...gle.com" <sagis@...gle.com>, "Chatre, Reinette"
<reinette.chatre@...el.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "kirill.shutemov@...ux.intel.com"
<kirill.shutemov@...ux.intel.com>, "Williams, Dan J"
<dan.j.williams@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"pbonzini@...hat.com" <pbonzini@...hat.com>, "Yamahata, Isaku"
<isaku.yamahata@...el.com>, "ashish.kalra@....com" <ashish.kalra@....com>,
"nik.borisov@...e.com" <nik.borisov@...e.com>
Subject: Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd
during kexec
On Thu, 2025-06-26 at 22:48 +1200, Kai Huang wrote:
> On SME platforms, at hardware level dirty cachelines with and without
> encryption bit of the same memory can coexist, and the CPU can flush
> them back to memory in random order.
>
Nit: It's a bit of a run-on sentence. I know we struggled with this one. But I
think this might be easier:
On SME platforms, dirty cacheline aliases with and without encryption bits
can coexist, and the CPU can flush them back to memory in random order.
> During kexec, the caches must be
> flushed before jumping to the new kernel to avoid silent memory
> corruption when a cacheline with a different encryption property is
> written back over whatever encryption properties the new kernel is
> using.
>
> TDX also needs cache flush during kexec for the same reason. It would
^ a
> be good to implement a generic way to flush cache instead of scattering
^ the
> checks for each feature all around.
>
> During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
> them before it boots to the new kernel. For SME the kernel basically
> encrypts all memory including the kernel itself by manipulating page
> tables. A simple memory write from the kernel could dirty cachelines.
>
> Currently, the kernel uses WBINVD to flush cache for SME during kexec in
^the
> two places:
>
> 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
> stops them;
> 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
> new kernel.
>
> Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
> SEAMCALLs are performed). Since there are no more SEAMCALLs after the
> aforementioned WBINVDs, leverage this for TDX.
>
> In order to have a generic way to cover both SME and TDX, use a percpu
> boolean to indicate the cache may be in an incoherent state thus cache
> flush is needed during kexec, and turn on the boolean for SME. TDX can
> then leverage it by also turning the boolean on.
>
> A percpu boolean isn't strictly needed for SME since it is activated at
> very early boot time and on all CPUs. A global flag would be
> sufficient. But using a percpu variable has two benefits. Foremost,
> the state that is being tracked here (percpu cache coherency situation
> requiring a flush) is percpu, so a percpu state is a more direct and
> natural fit.
>
> Secondly, it will fit TDX's usage better since the percpu var can be set
> when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
> obviates the need for a kexec-time WBINVD. Saving kexec-time WBINVD is
> valuable, because there is an existing race[*] where kexec could proceed
> while another CPU is active. WBINVD could make this race worse, so it's
> worth skipping it when possible.
>
> Today the first WBINVD in the stop_this_cpu() is performed when SME is
> *supported* by the platform, and the second WBINVD is done in
> relocate_kernel() when SME is *activated* by the kernel. Make things
> simple by changing to do the second WBINVD when the platform supports
> SME. This allows the kernel to simply turn on this percpu boolean when
> bringing up a CPU by checking whether the platform supports SME.
Since the race is related to stop_this_cpu() it doesn't affect it. But it does
mean that the bring up CPU may not flush the cache if takes a kdump kexec before
the per-cpu var is set? Of course the existing logic doesn't trigger until
cpuinfo_x86 is populated. What is the consequence?
Arguably the supported/enabled part could be moved to a separate earlier patch.
The code change would just get immediately replaced, but the benefit would be
that a bisect would show which part of the change was responsible.
>
> No other functional change intended.
>
> Also, currently machine_check() has a comment to explain why no function
Nittiest of the nits: I would move this section above "No function..." and drop
the "also". This is the same old "notes" critique I always make. Just when you
think you have got all the details and start to process, it asks you to update
your model.
> call is allowed after load_segments(). After changing to use the new
> percpu boolean to control whether to perform WBINVD when calling the
> relocate_kernel(), that comment isn't needed anymore. But it is still a
> useful comment, so expand the comment around load_segments() to mention
> that due to depth tracking no function call can be made after
> load_segments().
>
> [*] The "race" in native_stop_other_cpus()
>
> Commit
>
> 1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
>
> introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
> poweroff" issue which was caused by the WBINVD in stop_this_cpu().
> Specifically, the new cpumask resolved the below problem mentioned in
> that commit:
>
> CPU0 CPU1
>
> stop_other_cpus()
> send_IPIs(REBOOT); stop_this_cpu()
> while (num_online_cpus() > 1); set_online(false);
> proceed... -> hang
> wbinvd()
>
> While it fixed the reported issue, that commit explained the new cpumask
> "cannot plug all holes either".
>
> This is because it doesn't address the "race" mentioned in the #3 in the
> comment in native_stop_other_cpus():
>
> /*
> * 1) Send an IPI on the reboot vector to all other CPUs.
> *
> * The other CPUs should react on it after leaving critical
> * sections and re-enabling interrupts. They might still hold
> * locks, but there is nothing which can be done about that.
> *
> * 2) Wait for all other CPUs to report that they reached the
> * HLT loop in stop_this_cpu()
> *
> * 3) If #2 timed out send an NMI to the CPUs which did not
> * yet report
> *
> * 4) Wait for all other CPUs to report that they reached the
> * HLT loop in stop_this_cpu()
> *
> * #3 can obviously race against a CPU reaching the HLT loop late.
> * That CPU will have reported already and the "have all CPUs
> * reached HLT" condition will be true despite the fact that the
> * other CPU is still handling the NMI. Again, there is no
> * protection against that as "disabled" APICs still respond to
> * NMIs.
> */
>
> Consider below case:
>
> CPU 0 CPU 1
>
> native_stop_other_cpus() stop_this_cpu()
>
> // sends REBOOT IPI to stop remote CPUs
> ...
> wbinvd();
>
> // wait timesout, try NMI
> if (!cpumask_empty(&cpus_stop_mask)) {
> for_each_cpu(cpu, &cpus_stop_mask) {
>
> ...
> cpumask_clear_cpu(cpu,
> &cpus_stop_mask);
> hlt;
>
> // send NMI --->
> wakeup from hlt and run
> stop_this_cpu():
>
> // WAIT CPUs TO STOP
> while (!cpumask_empty(
> &cpus_stop_mask) &&
> ...)
> }
> ...
> proceed ... wbinvd();
> ...
> hlt;
>
> The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
> stopped, but actually it quits immediately because the remote CPUs have
> been cleared in cpus_stop_mask when stop_this_cpu() is called from the
> REBOOT IPI.
>
> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.
>
> Signed-off-by: Kai Huang <kai.huang@...el.com>
> ---
> arch/x86/include/asm/kexec.h | 2 +-
> arch/x86/include/asm/processor.h | 2 ++
> arch/x86/kernel/cpu/amd.c | 16 ++++++++++++++++
> arch/x86/kernel/machine_kexec_64.c | 15 ++++++++++-----
> arch/x86/kernel/process.c | 16 +++-------------
> arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
> 6 files changed, 43 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index f2ad77929d6e..d7e93522b93d 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
> unsigned long pa_control_page,
> unsigned long start_address,
> unsigned int preserve_context,
> - unsigned int host_mem_enc_active);
> + unsigned int cache_incoherent);
> #endif
> extern relocate_kernel_fn relocate_kernel;
> #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index bde58f6510ac..a24c7805acdb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
> void microcode_check(struct cpuinfo_x86 *prev_info);
> void store_cpu_caps(struct cpuinfo_x86 *info);
>
> +DECLARE_PER_CPU(bool, cache_state_incoherent);
> +
> enum l1tf_mitigations {
> L1TF_MITIGATION_OFF,
> L1TF_MITIGATION_AUTO,
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index f18f540db58c..4c7fde344216 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
> {
> u64 msr;
>
> + /*
> + * Mark using wbinvd is needed during kexec on processors that
> + * support SME. This provides support for performing a successful
> + * kexec when going from SME inactive to SME active (or vice-versa).
> + *
> + * The cache must be cleared so that if there are entries with the
> + * same physical address, both with and without the encryption bit,
> + * they don't race each other when flushed and potentially end up
> + * with the wrong entry being committed to memory.
> + *
> + * Test the CPUID bit directly because the machine might've cleared
> + * X86_FEATURE_SME due to cmdline options.
> + */
> + if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> + __this_cpu_write(cache_state_incoherent, true);
> +
> /*
> * BIOS support is required for SME and SEV.
> * For SME: If BIOS has enabled SME then adjust x86_phys_bits by
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 697fb99406e6..4519c7b75c49 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -29,6 +29,7 @@
> #include <asm/set_memory.h>
> #include <asm/cpu.h>
> #include <asm/efi.h>
> +#include <asm/processor.h>
>
> #ifdef CONFIG_ACPI
> /*
> @@ -384,15 +385,15 @@ void __nocfi machine_kexec(struct kimage *image)
> {
> unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> relocate_kernel_fn *relocate_kernel_ptr;
> - unsigned int host_mem_enc_active;
> + unsigned int cache_incoherent;
> int save_ftrace_enabled;
> void *control_page;
>
> /*
> - * This must be done before load_segments() since if call depth tracking
> - * is used then GS must be valid to make any function calls.
> + * This must be done before load_segments(), since it resets
> + * GS to 0 and percpu data needs the correct GS to work.
> */
> - host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> + cache_incoherent = this_cpu_read(cache_state_incoherent);
>
> #ifdef CONFIG_KEXEC_JUMP
> if (image->preserve_context)
> @@ -436,6 +437,10 @@ void __nocfi machine_kexec(struct kimage *image)
> *
> * Take advantage of this here by force loading the segments,
> * before the GDT is zapped with an invalid value.
> + *
> + * load_segments() resets GS to 0. Don't make any function call
> + * after here since call depth tracking uses percpu variables to
> + * operate (relocate_kernel() is explicitly ignored by call depth
> */
> load_segments();
>
> @@ -444,7 +449,7 @@ void __nocfi machine_kexec(struct kimage *image)
> virt_to_phys(control_page),
> image->start,
> image->preserve_context,
> - host_mem_enc_active);
> + cache_incoherent);
>
> #ifdef CONFIG_KEXEC_JUMP
> if (image->preserve_context)
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 7b94851bb37e..6b5edfbefa9a 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -88,6 +88,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
> DEFINE_PER_CPU(bool, __tss_limit_invalid);
> EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
>
> +DEFINE_PER_CPU(bool, cache_state_incoherent);
> +
> /*
> * this gets called so that we can store lazy state into memory and copy the
> * current task into the new thread.
> @@ -827,19 +829,7 @@ void __noreturn stop_this_cpu(void *dummy)
> disable_local_APIC();
> mcheck_cpu_clear(c);
>
> - /*
> - * Use wbinvd on processors that support SME. This provides support
> - * for performing a successful kexec when going from SME inactive
> - * to SME active (or vice-versa). The cache must be cleared so that
> - * if there are entries with the same physical address, both with and
> - * without the encryption bit, they don't race each other when flushed
> - * and potentially end up with the wrong entry being committed to
> - * memory.
> - *
> - * Test the CPUID bit directly because the machine might've cleared
> - * X86_FEATURE_SME due to cmdline options.
> - */
> - if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> + if (this_cpu_read(cache_state_incoherent))
> wbinvd();
>
> /*
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index ea604f4d0b52..34b3a5e4fe49 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -67,7 +67,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
> * %rsi pa_control_page
> * %rdx start address
> * %rcx preserve_context
> - * %r8 host_mem_enc_active
> + * %r8 cache_incoherent
> */
>
> /* Save the CPU context, used for jumping back */
> @@ -129,7 +129,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> /*
> * %rdi indirection page
> * %rdx start address
> - * %r8 host_mem_enc_active
> + * %r8 cache_incoherent
> * %r9 page table page
> * %r11 preserve_context
> * %r13 original CR4 when relocate_kernel() was invoked
> @@ -200,14 +200,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> movq %r9, %cr3
>
> /*
> + * If the memory cache is in incoherent state, e.g., due to
> + * memory encryption, do wbinvd to flush cache.
> + *
> * If SME is active, there could be old encrypted cache line
> * entries that will conflict with the now unencrypted memory
> * used by kexec. Flush the caches before copying the kernel.
> + *
> + * Note SME sets this flag to true when the platform supports
> + * SME, so the wbinvd is performed even SME is not activated
> + * by the kernel. But this has no harm.
> */
> testq %r8, %r8
> - jz .Lsme_off
> + jz .Lnowbinvd
> wbinvd
> -.Lsme_off:
> +.Lnowbinvd:
>
> call swap_pages
>
Powered by blists - more mailing lists