linux-kernel - [PATCH v4 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a85faae420b2a52bd8a3022900ffa8351c66d1ed.1752730040.git.kai.huang@intel.com>
Date: Fri, 18 Jul 2025 09:46:44 +1200
From: Kai Huang <kai.huang@...el.com>
To: dave.hansen@...el.com,
	bp@...en8.de,
	tglx@...utronix.de,
	peterz@...radead.org,
	mingo@...hat.com,
	hpa@...or.com,
	thomas.lendacky@....com
Cc: x86@...nel.org,
	kas@...nel.org,
	rick.p.edgecombe@...el.com,
	dwmw@...zon.co.uk,
	linux-kernel@...r.kernel.org,
	pbonzini@...hat.com,
	seanjc@...gle.com,
	kvm@...r.kernel.org,
	reinette.chatre@...el.com,
	isaku.yamahata@...el.com,
	dan.j.williams@...el.com,
	ashish.kalra@....com,
	nik.borisov@...e.com,
	chao.gao@...el.com,
	sagis@...gle.com,
	Farrah Chen <farrah.chen@...el.com>,
	Binbin Wu <binbin.wu@...ux.intel.com>
Subject: [PATCH v4 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs

On TDX platforms, during kexec, the kernel needs to make sure there are
no dirty cachelines of TDX private memory before booting to the new
kernel to avoid silent memory corruption to the new kernel.

During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
to stop all remote CPUs before booting to the new kernel.  The remote
CPUs will then execute stop_this_cpu() to stop themselves.

The kernel has a percpu boolean to indicate whether the cache of a CPU
may be in incoherent state.  In stop_this_cpu(), the kernel does WBINVD
if that percpu boolean is true.

TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
This makes sure the caches will be flushed during kexec.

However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
which is extremely rare to happen but could cause the system to hang.

Specifically, the native_stop_other_cpus() firstly sends normal reboot
IPI to remote CPUs and waits one second for them to stop.  If that times
out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
them.

The aforementioned race happens when NMIs are sent.  Doing WBINVD in
stop_this_cpu() makes each CPU take longer time to stop and increases
the chance of the race happening.

Explicitly flush cache in tdx_disable_virtualization_cpu() after which
no more TDX activity can happen on this cpu.  This moves the WBINVD to
an earlier stage than stop_this_cpus(), avoiding a possibly lengthy
operation at a time where it could cause this race.

Signed-off-by: Kai Huang <kai.huang@...el.com>
Acked-by: Paolo Bonzini <pbonzini@...hat.com>
Tested-by: Farrah Chen <farrah.chen@...el.com>
Reviewed-by: Binbin Wu <binbin.wu@...ux.intel.com>
---

v3 -> v4:
 - Change doing wbinvd() from rebooting notifier to
   tdx_disable_virtualization_cpu() to cover the case where more
   SEAMCALL can be made after cache flush, i.e., doing kexec when
   there's TD alive.  - Chao.
 - Add check to skip wbinvd if the boolean is false. -- Chao
 - Fix typo in the comment -- Binbin.

---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/kvm/vmx/tdx.c      | 12 ++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++++
 3 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6865f62436ad..5f1f4e8594c0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -221,6 +221,7 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
 u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
+void tdx_cpu_flush_cache(void);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
@@ -228,6 +229,7 @@ static inline int tdx_enable(void)  { return -ENODEV; }
 static inline u32 tdx_get_nr_guest_keyids(void) { return 0; }
 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
 static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
+static inline void tdx_cpu_flush_cache(void) { }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLER__ */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f31ccdeb905b..478baaa1bfb5 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -444,6 +444,18 @@ void tdx_disable_virtualization_cpu(void)
 		tdx_flush_vp(&arg);
 	}
 	local_irq_restore(flags);
+
+	/*
+	 * No more TDX activity on this CPU from here.  Flush cache to
+	 * avoid having to do WBINVD in stop_this_cpu() during kexec.
+	 *
+	 * Kexec calls native_stop_other_cpus() to stop remote CPUs
+	 * before booting to new kernel, but that code has a "race"
+	 * when the normal REBOOT IPI times out and NMIs are sent to
+	 * remote CPUs to stop them.  Doing WBINVD in stop_this_cpu()
+	 * could potentially increase the possibility of the "race".
+	 */
+	tdx_cpu_flush_cache();
 }
 
 #define TDX_SEAMCALL_RETRIES 10000
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index d6ee4e5a75d2..c098a6e0382b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1870,3 +1870,15 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+
+void tdx_cpu_flush_cache(void)
+{
+	lockdep_assert_preemption_disabled();
+
+	if (!this_cpu_read(cache_state_incoherent))
+		return;
+
+	wbinvd();
+	this_cpu_write(cache_state_incoherent, false);
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);
-- 
2.50.0