linux-kernel - Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ffcb59ff61de9b3189cf1f1cc2f331c5d0b54170.camel@intel.com>
Date: Thu, 26 Jun 2025 17:59:17 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Hansen, Dave" <dave.hansen@...el.com>, "Huang, Kai"
	<kai.huang@...el.com>, "bp@...en8.de" <bp@...en8.de>, "peterz@...radead.org"
	<peterz@...radead.org>, "hpa@...or.com" <hpa@...or.com>, "mingo@...hat.com"
	<mingo@...hat.com>, "tglx@...utronix.de" <tglx@...utronix.de>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>
CC: "seanjc@...gle.com" <seanjc@...gle.com>, "x86@...nel.org"
	<x86@...nel.org>, "sagis@...gle.com" <sagis@...gle.com>, "Chatre, Reinette"
	<reinette.chatre@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "kirill.shutemov@...ux.intel.com"
	<kirill.shutemov@...ux.intel.com>, "Williams, Dan J"
	<dan.j.williams@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "ashish.kalra@....com" <ashish.kalra@....com>,
	"nik.borisov@...e.com" <nik.borisov@...e.com>
Subject: Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd
 during kexec

On Thu, 2025-06-26 at 22:48 +1200, Kai Huang wrote:
> On SME platforms, at hardware level dirty cachelines with and without
> encryption bit of the same memory can coexist, and the CPU can flush
> them back to memory in random order.
> 

Nit: It's a bit of a run-on sentence. I know we struggled with this one. But I
think this might be easier:

  On SME platforms, dirty cacheline aliases with and without encryption bits
  can coexist, and the CPU can flush them back to memory in random order.

>   During kexec, the caches must be
> flushed before jumping to the new kernel to avoid silent memory
> corruption when a cacheline with a different encryption property is
> written back over whatever encryption properties the new kernel is
> using.
> 
> TDX also needs cache flush during kexec for the same reason.  It would
                ^ a
> be good to implement a generic way to flush cache instead of scattering
                                             ^ the
> checks for each feature all around.
> 
> During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
> them before it boots to the new kernel.  For SME the kernel basically
> encrypts all memory including the kernel itself by manipulating page
> tables.  A simple memory write from the kernel could dirty cachelines.
> 
> Currently, the kernel uses WBINVD to flush cache for SME during kexec in
                                             ^the
> two places:
> 
> 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
>    stops them;
> 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
>    new kernel.
> 
> Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
> SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
> aforementioned WBINVDs, leverage this for TDX.
> 
> In order to have a generic way to cover both SME and TDX, use a percpu
> boolean to indicate the cache may be in an incoherent state thus cache
> flush is needed during kexec, and turn on the boolean for SME.  TDX can
> then leverage it by also turning the boolean on.
> 
> A percpu boolean isn't strictly needed for SME since it is activated at
> very early boot time and on all CPUs.  A global flag would be
> sufficient.  But using a percpu variable has two benefits.  Foremost,
> the state that is being tracked here (percpu cache coherency situation
> requiring a flush) is percpu, so a percpu state is a more direct and
> natural fit.
> 
> Secondly, it will fit TDX's usage better since the percpu var can be set
> when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
> obviates the need for a kexec-time WBINVD.  Saving kexec-time WBINVD is
> valuable, because there is an existing race[*] where kexec could proceed
> while another CPU is active.  WBINVD could make this race worse, so it's
> worth skipping it when possible.
> 
> Today the first WBINVD in the stop_this_cpu() is performed when SME is
> *supported* by the platform, and the second WBINVD is done in
> relocate_kernel() when SME is *activated* by the kernel.  Make things
> simple by changing to do the second WBINVD when the platform supports
> SME.  This allows the kernel to simply turn on this percpu boolean when
> bringing up a CPU by checking whether the platform supports SME.

Since the race is related to stop_this_cpu() it doesn't affect it. But it does
mean that the bring up CPU may not flush the cache if takes a kdump kexec before
the per-cpu var is set? Of course the existing logic doesn't trigger until
cpuinfo_x86 is populated. What is the consequence?

Arguably the supported/enabled part could be moved to a separate earlier patch.
The code change would just get immediately replaced, but the benefit would be
that a bisect would show which part of the change was responsible.

> 
> No other functional change intended.
> 
> Also, currently machine_check() has a comment to explain why no function

Nittiest of the nits: I would move this section above "No function..." and drop
the "also". This is the same old "notes" critique I always make. Just when you
think you have got all the details and start to process, it asks you to update
your model.

> call is allowed after load_segments().  After changing to use the new
> percpu boolean to control whether to perform WBINVD when calling the
> relocate_kernel(), that comment isn't needed anymore.  But it is still a
> useful comment, so expand the comment around load_segments() to mention
> that due to depth tracking no function call can be made after
> load_segments().
> 
> [*] The "race" in native_stop_other_cpus()
> 
> Commit
> 
>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> 
> introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
> poweroff" issue which was caused by the WBINVD in stop_this_cpu().
> Specifically, the new cpumask resolved the below problem mentioned in
> that commit:
> 
>     CPU0                                    CPU1
> 
>      stop_other_cpus()
>        send_IPIs(REBOOT);                   stop_this_cpu()
>        while (num_online_cpus() > 1);         set_online(false);
>        proceed... -> hang
>                                               wbinvd()
> 
> While it fixed the reported issue, that commit explained the new cpumask
> "cannot plug all holes either".
> 
> This is because it doesn't address the "race" mentioned in the #3 in the
> comment in native_stop_other_cpus():
> 
>         /*
>          * 1) Send an IPI on the reboot vector to all other CPUs.
>          *
>          *    The other CPUs should react on it after leaving critical
>          *    sections and re-enabling interrupts. They might still hold
>          *    locks, but there is nothing which can be done about that.
>          *
>          * 2) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * 3) If #2 timed out send an NMI to the CPUs which did not
>          *    yet report
>          *
>          * 4) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * #3 can obviously race against a CPU reaching the HLT loop late.
>          * That CPU will have reported already and the "have all CPUs
>          * reached HLT" condition will be true despite the fact that the
>          * other CPU is still handling the NMI. Again, there is no
>          * protection against that as "disabled" APICs still respond to
>          * NMIs.
>          */
> 
> Consider below case:
> 
>     CPU 0					CPU 1
> 
>     native_stop_other_cpus()			stop_this_cpu()
> 
> 	// sends REBOOT IPI to stop remote CPUs
> 						...
> 						wbinvd();
> 
> 	// wait timesout, try NMI
> 	if (!cpumask_empty(&cpus_stop_mask)) {
> 		for_each_cpu(cpu, &cpus_stop_mask) {
> 
> 						...
> 						cpumask_clear_cpu(cpu,
> 							&cpus_stop_mask);
> 						hlt;
> 
> 			// send NMI 	--->
> 						wakeup from hlt and run
> 						stop_this_cpu():
> 
> 			// WAIT CPUs TO STOP
> 			while (!cpumask_empty(
> 			    &cpus_stop_mask) &&
> 				...)
> 		}
> 						...
> 		proceed ...			wbinvd();
> 						...
> 						hlt;
> 
> The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
> stopped, but actually it quits immediately because the remote CPUs have
> been cleared in cpus_stop_mask when stop_this_cpu() is called from the
> REBOOT IPI.
> 
> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.
> 
> Signed-off-by: Kai Huang <kai.huang@...el.com>
> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
>  arch/x86/kernel/process.c            | 16 +++-------------
>  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
>  6 files changed, 43 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index f2ad77929d6e..d7e93522b93d 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
>  		   unsigned long pa_control_page,
>  		   unsigned long start_address,
>  		   unsigned int preserve_context,
> -		   unsigned int host_mem_enc_active);
> +		   unsigned int cache_incoherent);
>  #endif
>  extern relocate_kernel_fn relocate_kernel;
>  #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index bde58f6510ac..a24c7805acdb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
>  void microcode_check(struct cpuinfo_x86 *prev_info);
>  void store_cpu_caps(struct cpuinfo_x86 *info);
>  
> +DECLARE_PER_CPU(bool, cache_state_incoherent);
> +
>  enum l1tf_mitigations {
>  	L1TF_MITIGATION_OFF,
>  	L1TF_MITIGATION_AUTO,
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index f18f540db58c..4c7fde344216 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>  {
>  	u64 msr;
>  
> +	/*
> +	 * Mark using wbinvd is needed during kexec on processors that
> +	 * support SME. This provides support for performing a successful
> +	 * kexec when going from SME inactive to SME active (or vice-versa).
> +	 *
> +	 * The cache must be cleared so that if there are entries with the
> +	 * same physical address, both with and without the encryption bit,
> +	 * they don't race each other when flushed and potentially end up
> +	 * with the wrong entry being committed to memory.
> +	 *
> +	 * Test the CPUID bit directly because the machine might've cleared
> +	 * X86_FEATURE_SME due to cmdline options.
> +	 */
> +	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +		__this_cpu_write(cache_state_incoherent, true);
> +
>  	/*
>  	 * BIOS support is required for SME and SEV.
>  	 *   For SME: If BIOS has enabled SME then adjust x86_phys_bits by
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 697fb99406e6..4519c7b75c49 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -29,6 +29,7 @@
>  #include <asm/set_memory.h>
>  #include <asm/cpu.h>
>  #include <asm/efi.h>
> +#include <asm/processor.h>
>  
>  #ifdef CONFIG_ACPI
>  /*
> @@ -384,15 +385,15 @@ void __nocfi machine_kexec(struct kimage *image)
>  {
>  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
>  	relocate_kernel_fn *relocate_kernel_ptr;
> -	unsigned int host_mem_enc_active;
> +	unsigned int cache_incoherent;
>  	int save_ftrace_enabled;
>  	void *control_page;
>  
>  	/*
> -	 * This must be done before load_segments() since if call depth tracking
> -	 * is used then GS must be valid to make any function calls.
> +	 * This must be done before load_segments(), since it resets
> +	 * GS to 0 and percpu data needs the correct GS to work.
>  	 */
> -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> +	cache_incoherent = this_cpu_read(cache_state_incoherent);
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> @@ -436,6 +437,10 @@ void __nocfi machine_kexec(struct kimage *image)
>  	 *
>  	 * Take advantage of this here by force loading the segments,
>  	 * before the GDT is zapped with an invalid value.
> +	 *
> +	 * load_segments() resets GS to 0.  Don't make any function call
> +	 * after here since call depth tracking uses percpu variables to
> +	 * operate (relocate_kernel() is explicitly ignored by call depth
>  	 */
>  	load_segments();
>  
> @@ -444,7 +449,7 @@ void __nocfi machine_kexec(struct kimage *image)
>  					   virt_to_phys(control_page),
>  					   image->start,
>  					   image->preserve_context,
> -					   host_mem_enc_active);
> +					   cache_incoherent);
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 7b94851bb37e..6b5edfbefa9a 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -88,6 +88,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
>  DEFINE_PER_CPU(bool, __tss_limit_invalid);
>  EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
>  
> +DEFINE_PER_CPU(bool, cache_state_incoherent);
> +
>  /*
>   * this gets called so that we can store lazy state into memory and copy the
>   * current task into the new thread.
> @@ -827,19 +829,7 @@ void __noreturn stop_this_cpu(void *dummy)
>  	disable_local_APIC();
>  	mcheck_cpu_clear(c);
>  
> -	/*
> -	 * Use wbinvd on processors that support SME. This provides support
> -	 * for performing a successful kexec when going from SME inactive
> -	 * to SME active (or vice-versa). The cache must be cleared so that
> -	 * if there are entries with the same physical address, both with and
> -	 * without the encryption bit, they don't race each other when flushed
> -	 * and potentially end up with the wrong entry being committed to
> -	 * memory.
> -	 *
> -	 * Test the CPUID bit directly because the machine might've cleared
> -	 * X86_FEATURE_SME due to cmdline options.
> -	 */
> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +	if (this_cpu_read(cache_state_incoherent))
>  		wbinvd();
>  
>  	/*
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index ea604f4d0b52..34b3a5e4fe49 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -67,7 +67,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>  	 * %rsi pa_control_page
>  	 * %rdx start address
>  	 * %rcx preserve_context
> -	 * %r8  host_mem_enc_active
> +	 * %r8  cache_incoherent
>  	 */
>  
>  	/* Save the CPU context, used for jumping back */
> @@ -129,7 +129,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	/*
>  	 * %rdi	indirection page
>  	 * %rdx start address
> -	 * %r8 host_mem_enc_active
> +	 * %r8 cache_incoherent
>  	 * %r9 page table page
>  	 * %r11 preserve_context
>  	 * %r13 original CR4 when relocate_kernel() was invoked
> @@ -200,14 +200,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	movq	%r9, %cr3
>  
>  	/*
> +	 * If the memory cache is in incoherent state, e.g., due to
> +	 * memory encryption, do wbinvd to flush cache.
> +	 *
>  	 * If SME is active, there could be old encrypted cache line
>  	 * entries that will conflict with the now unencrypted memory
>  	 * used by kexec. Flush the caches before copying the kernel.
> +	 *
> +	 * Note SME sets this flag to true when the platform supports
> +	 * SME, so the wbinvd is performed even SME is not activated
> +	 * by the kernel.  But this has no harm.
>  	 */
>  	testq	%r8, %r8
> -	jz .Lsme_off
> +	jz .Lnowbinvd
>  	wbinvd
> -.Lsme_off:
> +.Lnowbinvd:
>  
>  	call	swap_pages
>