linux-kernel - Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9a9380b55e1d01c650456e56be0949b531d88af5.camel@intel.com>
Date: Wed, 7 May 2025 21:03:13 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "tglx@...utronix.de" <tglx@...utronix.de>, "peterz@...radead.org"
	<peterz@...radead.org>, "mingo@...hat.com" <mingo@...hat.com>, "Hansen, Dave"
	<dave.hansen@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "bp@...en8.de"
	<bp@...en8.de>
CC: "seanjc@...gle.com" <seanjc@...gle.com>, "x86@...nel.org"
	<x86@...nel.org>, "sagis@...gle.com" <sagis@...gle.com>, "hpa@...or.com"
	<hpa@...or.com>, "Chatre, Reinette" <reinette.chatre@...el.com>,
	"kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
	"Williams, Dan J" <dan.j.williams@...el.com>, "ashish.kalra@....com"
	<ashish.kalra@....com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "nik.borisov@...e.com" <nik.borisov@...e.com>
Subject: Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at
 runtime

On Thu, 2025-04-17 at 11:56 -0700, Dave Hansen wrote:
> I get it. Adding WBINVD to this same path caused some pain before. But
> just turning off the feature that calls this path seems like overkill.
> 
> How about we try to push WBINVD out of this path? It should be quite
> doable for TDX, I think.
> 
> Let's say we had a percpu bool. It get set when SME is enabled on the
> system on each CPU. It also gets enabled when TDX is enabled. The kexec
> code becomes:
> 
> -	if (SME)
> +	if (per_cpu(newbool))
> 		wbinvd();
> 
> No TDX, no new wbinvd(). If SME, no change.
> 
> Now, here's where it gets fun. The bool can get _cleared_ after WBINVD
> is executed on a CPU, at least on TDX systems. It then also needs to get
> set after TDX might dirty a cacheline.
> 
> 	TDCALL(); // dirties stuff
> 	per_cpu(newbool) = 1;
> 
> Then you can also do this on_each_cpu():
> 
> 	wbinvd();
> 	per_cpu(newbool) = 0;
> 
> hopefully at point after you're sure no more TDCALLs are being made. If
> you screw it up, no biggie: the kexec-time one will make up for it,
> exposing TDX systems to the kexec timing bugs. But if the on_each_cpu()
> thing works in the common case, you get no additional bug exposure.

Kai and I have been discussing this internally. Here I'll try to move the
discussion external and continue it.

The problem with this approach turned out to be per_cpu(newbool) = 1 part is
racy with the SEAMCALL. The task could reschedule between SEAMCALL and the per-
cpu set. Disabling preemption around every SEAMCALL should be possible, but it
seems a bit heavyweight for what originally appeared to be an easy way to reduce
but not eliminate the chances of hitting the race.

What if instead, 'newbool' was a global. For SME it can be set at boot. For TDX,
we can minimize the chances of the race in a lesser way. TDX only needs to
WBINVD in stop_this_cpu() when:
1. TDX is late loaded by KVM (currently via module parameter, but maybe someday
when the first TD is loaded)
2. When the native_stop_other_cpus is happening as part of a kexec.

There is already special case logic for kexec in native_stop_other_cpus. So we
just need to add a check of if TDX is loaded, like this:

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 962c3ce39323..147f784d6d0f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -818,19 +818,7 @@ void __noreturn stop_this_cpu(void *dummy)
        disable_local_APIC();
        mcheck_cpu_clear(c);
 
-       /*
-        * Use wbinvd on processors that support SME. This provides support
-        * for performing a successful kexec when going from SME inactive
-        * to SME active (or vice-versa). The cache must be cleared so that
-        * if there are entries with the same physical address, both with and
-        * without the encryption bit, they don't race each other when flushed
-        * and potentially end up with the wrong entry being committed to
-        * memory.
-        *
-        * Test the CPUID bit directly because the machine might've cleared
-        * X86_FEATURE_SME due to cmdline options.
-        */
-       if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) &
BIT(0)))
+       if (cache_state_incoherent)
                wbinvd();
 
        /*
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 18266cc3d98c..a254f8842ca6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -160,10 +160,19 @@ static void native_stop_other_cpus(int wait)
        if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
                return;
 
-       /* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
-       if (kexec_in_progress)
+       /*
+        * For kexec, ensure that offline CPUs are out of MWAIT and in HLT.
+        * Also, TDX needs to flush caches when TDX is loaded for the kexec
+        * case. Make sure stop_this_cpu() knows to do this. Otherwise, skip
+        * it due to not increase the chances of the NMI-cpumask race.
+        */
+       if (kexec_in_progress) {
                smp_kick_mwait_play_dead();
 
+               if (tdx_check_load_and_block())
+                       cache_state_incoherent = true;
+       }
+
        /*
         * 1) Send an IPI on the reboot vector to all other CPUs.
         *
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index af8798bc62ed..6cca847e7a9c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1497,6 +1497,18 @@ const struct tdx_sys_info *tdx_get_sysinfo(void)
 }
 EXPORT_SYMBOL_GPL(tdx_get_sysinfo);
 
+/*
+ * Calling this blocks TDX from ever getting loaded, only for use
+ * during shutdown
+ */
+bool tdx_check_load_and_block(void)
+{
+       if (!mutex_trylock(&tdx_module_lock))
+               return true;
+
+       return tdx_module_status != TDX_MODULE_UNINITIALIZED;
+}
+
 u32 tdx_get_nr_guest_keyids(void)
 {
        return tdx_nr_guest_keyids;



This would leave it as kexec should be generally expected to work whenever TDX
is enabled, but the race is reduced when it's possible to tell it's not enabled.
It pollutes the layers a bit with assumptions on how things are called. But it's
sort of aligned with how kexec_in_progress already works in that way.

Otherwise, maybe we just go with an SME-like approach that limits the WBINVD to
certain systems more statically (i.e.
boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM)). This was pretty much the original
approach way back. It reduce race exposure too in the scope of all kernels out
there, and no layering violations.