[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d1cd2b34-9a3f-4dfc-93a2-2a20e9f16e1d@intel.com>
Date: Fri, 4 Jul 2025 09:32:07 +0800
From: Xiaoyao Li <xiaoyao.li@...el.com>
To: Adrian Hunter <adrian.hunter@...el.com>,
Dave Hansen <dave.hansen@...ux.intel.com>, pbonzini@...hat.com,
seanjc@...gle.com, vannapurve@...gle.com
Cc: Tony Luck <tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>,
Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
x86@...nel.org, H Peter Anvin <hpa@...or.com>, linux-kernel@...r.kernel.org,
kvm@...r.kernel.org, rick.p.edgecombe@...el.com,
kirill.shutemov@...ux.intel.com, kai.huang@...el.com,
reinette.chatre@...el.com, tony.lindgren@...ux.intel.com,
binbin.wu@...ux.intel.com, isaku.yamahata@...el.com, yan.y.zhao@...el.com,
chao.gao@...el.com
Subject: Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless
X86_BUG_TDX_PW_MCE is present
On 7/3/2025 11:37 PM, Adrian Hunter wrote:
> Avoid clearing reclaimed TDX private pages unless the platform is affected
> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
> time on unaffected systems.
>
> Background
>
> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
>
> - Clears the TD Owner bit (which identifies TDX private memory) and
> integrity metadata without triggering integrity violations.
> - Clears poison from cache lines without consuming it, avoiding MCEs on
> access (refer TDX Module Base spec. 16.5. Handling Machine Check
> Events during Guest TD Operation).
>
> The TDX module also uses MOVDIR64B to initialize private pages before use.
> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
>
> In contrast, when private pages are reclaimed, the TDX Module handles
> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
>
> Problem
>
> Clearing all private pages during VM shutdown is costly. For guests
> with a large amount of memory it can take minutes.
>
> Solution
>
> TDX Module Base Architecture spec. documents that private pages reclaimed
> from a TD should be initialized using MOVDIR64B, in order to avoid
> integrity violation or TD bit mismatch detection when later being read
> using a shared HKID, refer April 2025 spec. "Page Initialization" in
> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
> Initialization by the Host VMM"
>
> That is an overstatement and will be clarified in coming versions of the
> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
> Mode" in the same spec, there is no issue accessing such reclaimed pages
> using a shared key that does not have integrity enabled. Linux always uses
> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
> description in "Intel Architecture Memory Encryption Technologies" spec
> version 1.6 April 2025. So there is no need to clear pages to avoid
> integrity violations.
>
> There remains a risk of poison consumption. However, in the context of
> TDX, it is expected that there would be a machine check associated with the
> original poisoning. On some platforms that results in a panic. However
> platforms may support "SEAM_NR" Machine Check capability, in which case
> Linux machine check handler marks the page as poisoned, which prevents it
> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
> Implement recovery for errors in TDX/SEAM non-root mode")
>
> Improvement
>
> By skipping the clearing step on unaffected platforms, shutdown time
> can improve by up to 40%.
>
> On platforms with the X86_BUG_TDX_PW_MCE erratum (SPR and EMR), continue
> clearing because these platforms may trigger poison on partial writes to
> previously-private pages, even with KeyID 0, refer commit 1e536e1068970
> ("x86/cpu: Detect TDX partial write machine check erratum")
>
> Signed-off-by: Adrian Hunter <adrian.hunter@...el.com>
> ---
>
>
> Changes in V2:
>
> Improve the comment
>
>
> arch/x86/virt/vmx/tdx/tdx.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 14d93ed05bd2..4fa86188aa40 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -642,6 +642,14 @@ void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
> const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
> unsigned long phys, end;
>
> + /*
> + * Typically, any write to the page will convert it from TDX
> + * private back to normal kernel memory. Systems with the
> + * erratum need to do the conversion explicitly.
Can we call out that "system with erratum need to do the conversion
explicitly via MOVDIR64B" ?
Without "via MOVDIR64B", it leads to the impression that explicit
conversion with any write is OK for system with the erratum, and maybe
the following code just happened to use movdir64b().
> + */
> + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> + return;
> +
> end = base + size;
> for (phys = base; phys < end; phys += 64)
> movdir64b(__va(phys), zero_page);
Powered by blists - more mailing lists