linux-kernel - Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c625a52c-f251-4540-8e8c-6cc61df8cd84@intel.com>
Date: Mon, 24 Mar 2025 12:40:12 +0200
From: Adrian Hunter <adrian.hunter@...el.com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: <pbonzini@...hat.com>, <seanjc@...gle.com>, <kvm@...r.kernel.org>,
	<rick.p.edgecombe@...el.com>, <kirill.shutemov@...ux.intel.com>,
	<kai.huang@...el.com>, <reinette.chatre@...el.com>, <xiaoyao.li@...el.com>,
	<tony.lindgren@...ux.intel.com>, <binbin.wu@...ux.intel.com>,
	<isaku.yamahata@...el.com>, <linux-kernel@...r.kernel.org>,
	<yan.y.zhao@...el.com>, <chao.gao@...el.com>, David Hildenbrand
	<david@...hat.com>, <michael.roth@....com>, <ackerleytng@...gle.com>,
	<tabba@...gle.com>
Subject: Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease
 shutdown time

On 20/03/25 02:42, Vishal Annapurve wrote:
> On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@...el.com> wrote:
>>
>> Improve TDX shutdown performance by adding a more efficient shutdown
>> operation at the cost of adding separate branches for the TDX MMU
>> operations for normal runtime and shutdown.  This more efficient method was
>> previously used in earlier versions of the TDX patches, but was removed to
>> simplify the initial upstreaming.  This is an RFC, and still needs a proper
>> upstream commit log. It is intended to be an eventual follow up to base
>> support.
>>
>> == Background ==
>>
>> TDX has 2 methods for the host to reclaim guest private memory, depending
>> on whether the TD (TDX VM) is in a runnable state or not.  These are
>> called, respectively:
>>   1. Dynamic Page Removal
>>   2. Reclaiming TD Pages in TD_TEARDOWN State
>>
>> Dynamic Page Removal is much slower.  Reclaiming a 4K page in TD_TEARDOWN
>> state can be 5 times faster, although that is just one part of TD shutdown.
>>
>> == Relevant TDX Architecture ==
>>
>> Dynamic Page Removal is slow because it has to potentially deal with a
>> running TD, and so involves a number of steps:
>>         Block further address translation
>>         Exit each VCPU
>>         Clear Secure EPT entry
>>         Flush/write-back/invalidate relevant caches
>>
>> Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
>> procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
>> flushing.  For details, see TDX Module Base Spec October 2024 sections:
>>
>>         7.5.   TD Teardown Sequence
>>         5.6.3. TD Keys Reclamation, TLB and Cache Flush
>>
>> Essentially all that remains then is to take each page away from the
>> TDX Module and return it to the kernel.
>>
>> == Problem ==
>>
>> Currently, Dynamic Page Removal is being used when the TD is being
>> shutdown for the sake of having simpler initial code.
>>
>> This happens when guest_memfds are closed, refer kvm_gmem_release().
>> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
>> happen until after they are released, refer kvm_gmem_release().
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time.  For example:
>>
>>         VCPUs   Size (GB)       Before (secs)   After (secs)
>>          4       18              72              24
>>         32      107             517             134
>>
>> Note, the V19 patch set:
>>
>>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>>
>> did not have this issue because the HKID was released early, something that
>> Sean effectively NAK'ed:
>>
>>         "No, the right answer is to not release the HKID until the VM is
>>         destroyed."
>>
>>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/
>>
>> That was taken on board in the "TDX MMU Part 2" patch set.  Refer
>> "Moving of tdx_mmu_release_hkid()" in:
>>
>>         https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
>>
>> == Options ==
>>
>>   1. Start TD teardown earlier so that when pages are removed,
>>   they can be reclaimed faster.
>>   2. Defer page removal until after TD teardown has started.
>>   3. A combination of 1 and 2.
>>
>> Option 1 is problematic because it means putting the TD into a non-runnable
>> state while it is potentially still active. Also, as mentioned above, Sean
>> effectively NAK'ed it.
>>
>> Option 2 is possible because the lifetime of guest memory pages is separate
>> from guest_memfd (struct kvm_gmem) lifetime.
>>
>> A reference is taken to pages when they are faulted in, refer
>> kvm_gmem_get_pfn().  That reference is not released until the pages are
>> removed from the mirror SEPT, refer tdx_unpin().
> 
> Note that this logic to pin guest memory pages should go away to
> support in-place conversion for huge pages [1], though you can still
> pin inodes IIUC.
> 
> [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/

Then the virt code should handle the pinning since it becomes
essential.

inode->i_mapping->i_private_list can be used because we only need to
pin inodes that have no gmem references.  Although it is using it as
a list entry and elsewhere it is used as a list head.

So perhaps like this:

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 0d445a317f61..90e7354d6f40 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,7 @@ config KVM_INTEL
 	depends on KVM && IA32_FEAT_CTL
 	select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
 	select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
+	select HAVE_KVM_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
 	  extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1f354b3dbc28..acb022291aec 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,6 +5,7 @@
 #include <asm/fpu/xcr.h>
 #include <linux/misc_cgroup.h>
 #include <linux/mmu_context.h>
+#include <linux/fs.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
 #include "mmu.h"
@@ -626,6 +627,7 @@ int tdx_vm_init(struct kvm *kvm)
 	kvm->arch.has_protected_state = true;
 	kvm->arch.has_private_mem = true;
 	kvm->arch.disabled_quirks |= KVM_X86_QUIRK_IGNORE_GUEST_PAT;
+	kvm->gmem_defer_removal = true;
 
 	/*
 	 * Because guest TD is protected, VMM can't parse the instruction in TD.
@@ -1839,19 +1841,28 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return tdx_reclaim_page(virt_to_page(private_spt));
 }
 
+static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page)
+{
+	int ret;
+
+	if (level != PG_LEVEL_4K)
+		return -EINVAL;
+
+	ret = tdx_reclaim_page(page);
+	if (!ret)
+		tdx_unpin(kvm, page);
+
+	return ret;
+}
+
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 				 enum pg_level level, kvm_pfn_t pfn)
 {
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm)))
+		return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn));
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ed1968f6f841..2630b88b0983 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -860,6 +860,10 @@ struct kvm {
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 	/* Protected by slots_locks (for writes) and RCU (for reads) */
 	struct xarray mem_attr_array;
+#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+	struct list_head gmem_pinned_inodes;
+	bool gmem_defer_removal;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
 };
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..f56a7e89116d 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_GMEM_DEFER_REMOVAL
+       bool
+       depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b2aa6bf24d3a..a673da75a499 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -248,6 +248,36 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
 	return ret;
 }
 
+static bool kvm_gmem_defer_removal(struct kvm *kvm, struct kvm_gmem *gmem,
+				   struct inode *inode)
+{
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+	if (kvm->gmem_defer_removal) {
+		list_del(&gmem->entry);
+		if (list_empty(&inode->i_mapping->i_private_list)) {
+			list_add_tail(&inode->i_mapping->i_private_list,
+				      &kvm->gmem_pinned_inodes);
+			ihold(inode);
+		}
+		return true;
+	}
+#endif
+	return false;
+}
+
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+void kvm_gmem_unpin_inodes(struct kvm *kvm)
+{
+	struct address_space *mapping, *n;
+
+	list_for_each_entry_safe(mapping, n, &kvm->gmem_pinned_inodes,
+				 i_private_list) {
+		list_del_init(&mapping->i_private_list);
+		iput(mapping->host);
+	}
+}
+#endif
+
 static int kvm_gmem_release(struct inode *inode, struct file *file)
 {
 	struct kvm_gmem *gmem = file->private_data;
@@ -275,15 +305,18 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
 	xa_for_each(&gmem->bindings, index, slot)
 		WRITE_ONCE(slot->gmem.file, NULL);
 
-	/*
-	 * All in-flight operations are gone and new bindings can be created.
-	 * Zap all SPTEs pointed at by this file.  Do not free the backing
-	 * memory, as its lifetime is associated with the inode, not the file.
-	 */
-	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
-	kvm_gmem_invalidate_end(gmem, 0, -1ul);
+	if (!kvm_gmem_defer_removal(kvm, gmem, inode)) {
+		/*
+		 * All in-flight operations are gone and new bindings can be
+		 * created.  Zap all SPTEs pointed at by this file.  Do not free
+		 * the backing memory, as its lifetime is associated with the
+		 * inode, not the file.
+		 */
+		kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+		kvm_gmem_invalidate_end(gmem, 0, -1ul);
 
-	list_del(&gmem->entry);
+		list_del(&gmem->entry);
+	}
 
 	filemap_invalidate_unlock(inode->i_mapping);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 08f237bf4107..5c15828b86fb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1195,6 +1195,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	preempt_notifier_inc();
 	kvm_init_pm_notifier(kvm);
 
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+	INIT_LIST_HEAD(&kvm->gmem_pinned_inodes);
+#endif
+
 	return kvm;
 
 out_err_no_debugfs:
@@ -1299,6 +1303,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	cleanup_srcu_struct(&kvm->srcu);
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 	xa_destroy(&kvm->mem_attr_array);
+#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+	kvm_gmem_unpin_inodes(kvm);
 #endif
 	kvm_arch_free_vm(kvm);
 	preempt_notifier_dec();
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index acef3f5c582a..59562a355488 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -73,6 +73,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args);
 int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 		  unsigned int fd, loff_t offset);
 void kvm_gmem_unbind(struct kvm_memory_slot *slot);
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+void kvm_gmem_unpin_inodes(struct kvm *kvm);
+#endif // HAVE_KVM_GMEM_DEFER_REMOVAL
 #else
 static inline void kvm_gmem_init(struct module *module)
 {