[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c625a52c-f251-4540-8e8c-6cc61df8cd84@intel.com>
Date: Mon, 24 Mar 2025 12:40:12 +0200
From: Adrian Hunter <adrian.hunter@...el.com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: <pbonzini@...hat.com>, <seanjc@...gle.com>, <kvm@...r.kernel.org>,
<rick.p.edgecombe@...el.com>, <kirill.shutemov@...ux.intel.com>,
<kai.huang@...el.com>, <reinette.chatre@...el.com>, <xiaoyao.li@...el.com>,
<tony.lindgren@...ux.intel.com>, <binbin.wu@...ux.intel.com>,
<isaku.yamahata@...el.com>, <linux-kernel@...r.kernel.org>,
<yan.y.zhao@...el.com>, <chao.gao@...el.com>, David Hildenbrand
<david@...hat.com>, <michael.roth@....com>, <ackerleytng@...gle.com>,
<tabba@...gle.com>
Subject: Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease
shutdown time
On 20/03/25 02:42, Vishal Annapurve wrote:
> On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@...el.com> wrote:
>>
>> Improve TDX shutdown performance by adding a more efficient shutdown
>> operation at the cost of adding separate branches for the TDX MMU
>> operations for normal runtime and shutdown. This more efficient method was
>> previously used in earlier versions of the TDX patches, but was removed to
>> simplify the initial upstreaming. This is an RFC, and still needs a proper
>> upstream commit log. It is intended to be an eventual follow up to base
>> support.
>>
>> == Background ==
>>
>> TDX has 2 methods for the host to reclaim guest private memory, depending
>> on whether the TD (TDX VM) is in a runnable state or not. These are
>> called, respectively:
>> 1. Dynamic Page Removal
>> 2. Reclaiming TD Pages in TD_TEARDOWN State
>>
>> Dynamic Page Removal is much slower. Reclaiming a 4K page in TD_TEARDOWN
>> state can be 5 times faster, although that is just one part of TD shutdown.
>>
>> == Relevant TDX Architecture ==
>>
>> Dynamic Page Removal is slow because it has to potentially deal with a
>> running TD, and so involves a number of steps:
>> Block further address translation
>> Exit each VCPU
>> Clear Secure EPT entry
>> Flush/write-back/invalidate relevant caches
>>
>> Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
>> procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
>> flushing. For details, see TDX Module Base Spec October 2024 sections:
>>
>> 7.5. TD Teardown Sequence
>> 5.6.3. TD Keys Reclamation, TLB and Cache Flush
>>
>> Essentially all that remains then is to take each page away from the
>> TDX Module and return it to the kernel.
>>
>> == Problem ==
>>
>> Currently, Dynamic Page Removal is being used when the TD is being
>> shutdown for the sake of having simpler initial code.
>>
>> This happens when guest_memfds are closed, refer kvm_gmem_release().
>> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
>> happen until after they are released, refer kvm_gmem_release().
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time. For example:
>>
>> VCPUs Size (GB) Before (secs) After (secs)
>> 4 18 72 24
>> 32 107 517 134
>>
>> Note, the V19 patch set:
>>
>> https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>>
>> did not have this issue because the HKID was released early, something that
>> Sean effectively NAK'ed:
>>
>> "No, the right answer is to not release the HKID until the VM is
>> destroyed."
>>
>> https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/
>>
>> That was taken on board in the "TDX MMU Part 2" patch set. Refer
>> "Moving of tdx_mmu_release_hkid()" in:
>>
>> https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
>>
>> == Options ==
>>
>> 1. Start TD teardown earlier so that when pages are removed,
>> they can be reclaimed faster.
>> 2. Defer page removal until after TD teardown has started.
>> 3. A combination of 1 and 2.
>>
>> Option 1 is problematic because it means putting the TD into a non-runnable
>> state while it is potentially still active. Also, as mentioned above, Sean
>> effectively NAK'ed it.
>>
>> Option 2 is possible because the lifetime of guest memory pages is separate
>> from guest_memfd (struct kvm_gmem) lifetime.
>>
>> A reference is taken to pages when they are faulted in, refer
>> kvm_gmem_get_pfn(). That reference is not released until the pages are
>> removed from the mirror SEPT, refer tdx_unpin().
>
> Note that this logic to pin guest memory pages should go away to
> support in-place conversion for huge pages [1], though you can still
> pin inodes IIUC.
>
> [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
Then the virt code should handle the pinning since it becomes
essential.
inode->i_mapping->i_private_list can be used because we only need to
pin inodes that have no gmem references. Although it is using it as
a list entry and elsewhere it is used as a list head.
So perhaps like this:
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 0d445a317f61..90e7354d6f40 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,7 @@ config KVM_INTEL
depends on KVM && IA32_FEAT_CTL
select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
+ select HAVE_KVM_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST
help
Provides support for KVM on processors equipped with Intel's VT
extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1f354b3dbc28..acb022291aec 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,6 +5,7 @@
#include <asm/fpu/xcr.h>
#include <linux/misc_cgroup.h>
#include <linux/mmu_context.h>
+#include <linux/fs.h>
#include <asm/tdx.h>
#include "capabilities.h"
#include "mmu.h"
@@ -626,6 +627,7 @@ int tdx_vm_init(struct kvm *kvm)
kvm->arch.has_protected_state = true;
kvm->arch.has_private_mem = true;
kvm->arch.disabled_quirks |= KVM_X86_QUIRK_IGNORE_GUEST_PAT;
+ kvm->gmem_defer_removal = true;
/*
* Because guest TD is protected, VMM can't parse the instruction in TD.
@@ -1839,19 +1841,28 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
return tdx_reclaim_page(virt_to_page(private_spt));
}
+static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page)
+{
+ int ret;
+
+ if (level != PG_LEVEL_4K)
+ return -EINVAL;
+
+ ret = tdx_reclaim_page(page);
+ if (!ret)
+ tdx_unpin(kvm, page);
+
+ return ret;
+}
+
int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
struct page *page = pfn_to_page(pfn);
int ret;
- /*
- * HKID is released after all private pages have been removed, and set
- * before any might be populated. Warn if zapping is attempted when
- * there can't be anything populated in the private EPT.
- */
- if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
- return -EINVAL;
+ if (!is_hkid_assigned(to_kvm_tdx(kvm)))
+ return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn));
ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
if (ret <= 0)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ed1968f6f841..2630b88b0983 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -860,6 +860,10 @@ struct kvm {
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
/* Protected by slots_locks (for writes) and RCU (for reads) */
struct xarray mem_attr_array;
+#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+ struct list_head gmem_pinned_inodes;
+ bool gmem_defer_removal;
#endif
char stats_id[KVM_STATS_NAME_SIZE];
};
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..f56a7e89116d 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
config HAVE_KVM_ARCH_GMEM_INVALIDATE
bool
depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_GMEM_DEFER_REMOVAL
+ bool
+ depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b2aa6bf24d3a..a673da75a499 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -248,6 +248,36 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
return ret;
}
+static bool kvm_gmem_defer_removal(struct kvm *kvm, struct kvm_gmem *gmem,
+ struct inode *inode)
+{
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+ if (kvm->gmem_defer_removal) {
+ list_del(&gmem->entry);
+ if (list_empty(&inode->i_mapping->i_private_list)) {
+ list_add_tail(&inode->i_mapping->i_private_list,
+ &kvm->gmem_pinned_inodes);
+ ihold(inode);
+ }
+ return true;
+ }
+#endif
+ return false;
+}
+
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+void kvm_gmem_unpin_inodes(struct kvm *kvm)
+{
+ struct address_space *mapping, *n;
+
+ list_for_each_entry_safe(mapping, n, &kvm->gmem_pinned_inodes,
+ i_private_list) {
+ list_del_init(&mapping->i_private_list);
+ iput(mapping->host);
+ }
+}
+#endif
+
static int kvm_gmem_release(struct inode *inode, struct file *file)
{
struct kvm_gmem *gmem = file->private_data;
@@ -275,15 +305,18 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
xa_for_each(&gmem->bindings, index, slot)
WRITE_ONCE(slot->gmem.file, NULL);
- /*
- * All in-flight operations are gone and new bindings can be created.
- * Zap all SPTEs pointed at by this file. Do not free the backing
- * memory, as its lifetime is associated with the inode, not the file.
- */
- kvm_gmem_invalidate_begin(gmem, 0, -1ul);
- kvm_gmem_invalidate_end(gmem, 0, -1ul);
+ if (!kvm_gmem_defer_removal(kvm, gmem, inode)) {
+ /*
+ * All in-flight operations are gone and new bindings can be
+ * created. Zap all SPTEs pointed at by this file. Do not free
+ * the backing memory, as its lifetime is associated with the
+ * inode, not the file.
+ */
+ kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+ kvm_gmem_invalidate_end(gmem, 0, -1ul);
- list_del(&gmem->entry);
+ list_del(&gmem->entry);
+ }
filemap_invalidate_unlock(inode->i_mapping);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 08f237bf4107..5c15828b86fb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1195,6 +1195,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
preempt_notifier_inc();
kvm_init_pm_notifier(kvm);
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+ INIT_LIST_HEAD(&kvm->gmem_pinned_inodes);
+#endif
+
return kvm;
out_err_no_debugfs:
@@ -1299,6 +1303,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
cleanup_srcu_struct(&kvm->srcu);
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
xa_destroy(&kvm->mem_attr_array);
+#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+ kvm_gmem_unpin_inodes(kvm);
#endif
kvm_arch_free_vm(kvm);
preempt_notifier_dec();
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index acef3f5c582a..59562a355488 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -73,6 +73,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args);
int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
unsigned int fd, loff_t offset);
void kvm_gmem_unbind(struct kvm_memory_slot *slot);
+#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL
+void kvm_gmem_unpin_inodes(struct kvm *kvm);
+#endif // HAVE_KVM_GMEM_DEFER_REMOVAL
#else
static inline void kvm_gmem_init(struct module *module)
{
Powered by blists - more mailing lists