[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250703153712.155600-3-adrian.hunter@intel.com>
Date: Thu, 3 Jul 2025 18:37:12 +0300
From: Adrian Hunter <adrian.hunter@...el.com>
To: Dave Hansen <dave.hansen@...ux.intel.com>,
pbonzini@...hat.com,
seanjc@...gle.com,
vannapurve@...gle.com
Cc: Tony Luck <tony.luck@...el.com>,
Borislav Petkov <bp@...en8.de>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
x86@...nel.org,
H Peter Anvin <hpa@...or.com>,
linux-kernel@...r.kernel.org,
kvm@...r.kernel.org,
rick.p.edgecombe@...el.com,
kirill.shutemov@...ux.intel.com,
kai.huang@...el.com,
reinette.chatre@...el.com,
xiaoyao.li@...el.com,
tony.lindgren@...ux.intel.com,
binbin.wu@...ux.intel.com,
isaku.yamahata@...el.com,
yan.y.zhao@...el.com,
chao.gao@...el.com
Subject: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
Avoid clearing reclaimed TDX private pages unless the platform is affected
by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
time on unaffected systems.
Background
KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
- Clears the TD Owner bit (which identifies TDX private memory) and
integrity metadata without triggering integrity violations.
- Clears poison from cache lines without consuming it, avoiding MCEs on
access (refer TDX Module Base spec. 16.5. Handling Machine Check
Events during Guest TD Operation).
The TDX module also uses MOVDIR64B to initialize private pages before use.
If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
In contrast, when private pages are reclaimed, the TDX Module handles
flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
Problem
Clearing all private pages during VM shutdown is costly. For guests
with a large amount of memory it can take minutes.
Solution
TDX Module Base Architecture spec. documents that private pages reclaimed
from a TD should be initialized using MOVDIR64B, in order to avoid
integrity violation or TD bit mismatch detection when later being read
using a shared HKID, refer April 2025 spec. "Page Initialization" in
section "8.6.2. Platforms not Using ACT: Required Cache Flush and
Initialization by the Host VMM"
That is an overstatement and will be clarified in coming versions of the
spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
Mode" in the same spec, there is no issue accessing such reclaimed pages
using a shared key that does not have integrity enabled. Linux always uses
KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
description in "Intel Architecture Memory Encryption Technologies" spec
version 1.6 April 2025. So there is no need to clear pages to avoid
integrity violations.
There remains a risk of poison consumption. However, in the context of
TDX, it is expected that there would be a machine check associated with the
original poisoning. On some platforms that results in a panic. However
platforms may support "SEAM_NR" Machine Check capability, in which case
Linux machine check handler marks the page as poisoned, which prevents it
from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
Implement recovery for errors in TDX/SEAM non-root mode")
Improvement
By skipping the clearing step on unaffected platforms, shutdown time
can improve by up to 40%.
On platforms with the X86_BUG_TDX_PW_MCE erratum (SPR and EMR), continue
clearing because these platforms may trigger poison on partial writes to
previously-private pages, even with KeyID 0, refer commit 1e536e1068970
("x86/cpu: Detect TDX partial write machine check erratum")
Signed-off-by: Adrian Hunter <adrian.hunter@...el.com>
---
Changes in V2:
Improve the comment
arch/x86/virt/vmx/tdx/tdx.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 14d93ed05bd2..4fa86188aa40 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -642,6 +642,14 @@ void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
unsigned long phys, end;
+ /*
+ * Typically, any write to the page will convert it from TDX
+ * private back to normal kernel memory. Systems with the
+ * erratum need to do the conversion explicitly.
+ */
+ if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+ return;
+
end = base + size;
for (phys = base; phys < end; phys += 64)
movdir64b(__va(phys), zero_page);
--
2.48.1
Powered by blists - more mailing lists