[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABgObfZYjdW_Mp3JB0z38-RoAdhr4rwjzb_AOYfrmwZZ0ERBsw@mail.gmail.com>
Date: Thu, 13 Mar 2025 19:39:07 +0100
From: Paolo Bonzini <pbonzini@...hat.com>
To: Adrian Hunter <adrian.hunter@...el.com>
Cc: seanjc@...gle.com, kvm@...r.kernel.org, rick.p.edgecombe@...el.com,
kirill.shutemov@...ux.intel.com, kai.huang@...el.com,
reinette.chatre@...el.com, xiaoyao.li@...el.com,
tony.lindgren@...ux.intel.com, binbin.wu@...ux.intel.com,
isaku.yamahata@...el.com, linux-kernel@...r.kernel.org, yan.y.zhao@...el.com,
chao.gao@...el.com
Subject: Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease
shutdown time
On Thu, Mar 13, 2025 at 7:16 PM Adrian Hunter <adrian.hunter@...el.com> wrote:
> Improve TDX shutdown performance by adding a more efficient shutdown
> operation at the cost of adding separate branches for the TDX MMU
> operations for normal runtime and shutdown. This more efficient method was
> previously used in earlier versions of the TDX patches, but was removed to
> simplify the initial upstreaming. This is an RFC, and still needs a proper
> upstream commit log. It is intended to be an eventual follow up to base
> support.
In the latest code the HKID is released in kvm_arch_pre_destroy_vm().
That is before kvm_free_memslot() calls kvm_gmem_unbind(), which
results in fput() and hence kvm_gmem_release().
So, as long as userspace doesn't remove the memslots and close the
guestmemfds, shouldn't the TD_TEARDOWN method be usable?
Paolo
> == Background ==
>
> TDX has 2 methods for the host to reclaim guest private memory, depending
> on whether the TD (TDX VM) is in a runnable state or not. These are
> called, respectively:
> 1. Dynamic Page Removal
> 2. Reclaiming TD Pages in TD_TEARDOWN State
>
> Dynamic Page Removal is much slower. Reclaiming a 4K page in TD_TEARDOWN
> state can be 5 times faster, although that is just one part of TD shutdown.
>
> == Relevant TDX Architecture ==
>
> Dynamic Page Removal is slow because it has to potentially deal with a
> running TD, and so involves a number of steps:
> Block further address translation
> Exit each VCPU
> Clear Secure EPT entry
> Flush/write-back/invalidate relevant caches
>
> Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
> procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
> flushing. For details, see TDX Module Base Spec October 2024 sections:
>
> 7.5. TD Teardown Sequence
> 5.6.3. TD Keys Reclamation, TLB and Cache Flush
>
> Essentially all that remains then is to take each page away from the
> TDX Module and return it to the kernel.
>
> == Problem ==
>
> Currently, Dynamic Page Removal is being used when the TD is being
> shutdown for the sake of having simpler initial code.
>
> This happens when guest_memfds are closed, refer kvm_gmem_release().
> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
> happen until after they are released, refer kvm_gmem_release().
>
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time. For example:
>
> VCPUs Size (GB) Before (secs) After (secs)
> 4 18 72 24
> 32 107 517 134
>
> Note, the V19 patch set:
>
> https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>
> did not have this issue because the HKID was released early, something that
> Sean effectively NAK'ed:
>
> "No, the right answer is to not release the HKID until the VM is
> destroyed."
>
> https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/
>
> That was taken on board in the "TDX MMU Part 2" patch set. Refer
> "Moving of tdx_mmu_release_hkid()" in:
>
> https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
>
> == Options ==
>
> 1. Start TD teardown earlier so that when pages are removed,
> they can be reclaimed faster.
> 2. Defer page removal until after TD teardown has started.
> 3. A combination of 1 and 2.
>
> Option 1 is problematic because it means putting the TD into a non-runnable
> state while it is potentially still active. Also, as mentioned above, Sean
> effectively NAK'ed it.
>
> Option 2 is possible because the lifetime of guest memory pages is separate
> from guest_memfd (struct kvm_gmem) lifetime.
>
> A reference is taken to pages when they are faulted in, refer
> kvm_gmem_get_pfn(). That reference is not released until the pages are
> removed from the mirror SEPT, refer tdx_unpin().
>
> Option 3 is not needed because TD teardown begins during VM destruction
> before pages are reclaimed. TD_TEARDOWN state is entered by
> tdx_mmu_release_hkid(), whereas pages are reclaimed by tdp_mmu_zap_root(),
> as follows:
>
> kvm_arch_destroy_vm()
> ...
> vt_vm_pre_destroy()
> tdx_mmu_release_hkid()
> ...
> kvm_mmu_uninit_vm()
> kvm_mmu_uninit_tdp_mmu()
> kvm_tdp_mmu_invalidate_roots()
> kvm_tdp_mmu_zap_invalidated_roots()
> tdp_mmu_zap_root()
>
> == Proof of Concept for option 2 ==
>
> Assume user space never needs to close a guest_memfd except as part of VM
> shutdown.
>
> Add a callback from kvm_gmem_release() to decide whether to defer removal.
> For TDX, record the inode (up to a max. of 64 inodes) and pin it.
>
> Amend the release of guest_memfds to skip removing pages from the MMU
> in that case.
>
> Amend TDX private memory page removal to detect TD_TEARDOWN state, and
> reclaim the page accordingly.
>
> For TDX, finally unpin any pinned inodes.
>
> This hopefully illustrates what needs to be done, but guidance is sought
> for the best way to do it.
>
> Signed-off-by: Adrian Hunter <adrian.hunter@...el.com>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 3 ++
> arch/x86/kvm/Kconfig | 1 +
> arch/x86/kvm/vmx/main.c | 12 +++++++-
> arch/x86/kvm/vmx/tdx.c | 47 +++++++++++++++++++++++++-----
> arch/x86/kvm/vmx/tdx.h | 14 +++++++++
> arch/x86/kvm/vmx/x86_ops.h | 2 ++
> arch/x86/kvm/x86.c | 7 +++++
> include/linux/kvm_host.h | 5 ++++
> virt/kvm/Kconfig | 4 +++
> virt/kvm/guest_memfd.c | 26 ++++++++++++-----
> 11 files changed, 107 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 79406bf07a1c..e4728f1fe646 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -148,6 +148,7 @@ KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
> KVM_X86_OP_OPTIONAL(gmem_invalidate)
> +KVM_X86_OP_OPTIONAL_RET0(gmem_defer_removal)
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9b9dde476f3c..d1afb4e1c2ee 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1661,6 +1661,8 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical)
> return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
> }
>
> +struct inode;
> +
> struct kvm_x86_ops {
> const char *name;
>
> @@ -1888,6 +1890,7 @@ struct kvm_x86_ops {
> int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
> int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
> + int (*gmem_defer_removal)(struct kvm *kvm, struct inode *inode);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 0d445a317f61..32c4b9922e7b 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -96,6 +96,7 @@ config KVM_INTEL
> depends on KVM && IA32_FEAT_CTL
> select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
> select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
> + select HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST
> help
> Provides support for KVM on processors equipped with Intel's VT
> extensions, a.k.a. Virtual Machine Extensions (VMX).
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 94d5d907d37b..b835006e1282 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -888,6 +888,14 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> return 0;
> }
>
> +static int vt_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> + if (is_td(kvm))
> + return tdx_gmem_defer_removal(kvm, inode);
> +
> + return 0;
> +}
> +
> #define VMX_REQUIRED_APICV_INHIBITS \
> (BIT(APICV_INHIBIT_REASON_DISABLED) | \
> BIT(APICV_INHIBIT_REASON_ABSENT) | \
> @@ -1046,7 +1054,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> .mem_enc_ioctl = vt_mem_enc_ioctl,
> .vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
>
> - .private_max_mapping_level = vt_gmem_private_max_mapping_level
> + .private_max_mapping_level = vt_gmem_private_max_mapping_level,
> +
> + .gmem_defer_removal = vt_gmem_defer_removal,
> };
>
> struct kvm_x86_init_ops vt_init_ops __initdata = {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index d9eb20516c71..51bbb44ac1bd 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -5,6 +5,7 @@
> #include <asm/fpu/xcr.h>
> #include <linux/misc_cgroup.h>
> #include <linux/mmu_context.h>
> +#include <linux/fs.h>
> #include <asm/tdx.h>
> #include "capabilities.h"
> #include "mmu.h"
> @@ -594,10 +595,20 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
> kvm_tdx->td.tdr_page = NULL;
> }
>
> +static void tdx_unpin_inodes(struct kvm *kvm)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> + for (int i = 0; i < kvm_tdx->nr_gmem_inodes; i++)
> + iput(kvm_tdx->gmem_inodes[i]);
> +}
> +
> void tdx_vm_destroy(struct kvm *kvm)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>
> + tdx_unpin_inodes(kvm);
> +
> tdx_reclaim_td_control_pages(kvm);
>
> kvm_tdx->state = TD_STATE_UNINITIALIZED;
> @@ -1778,19 +1789,28 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> return tdx_reclaim_page(virt_to_page(private_spt));
> }
>
> +static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page)
> +{
> + int ret;
> +
> + if (level != PG_LEVEL_4K)
> + return -EINVAL;
> +
> + ret = tdx_reclaim_page(page);
> + if (!ret)
> + tdx_unpin(kvm, page);
> +
> + return ret;
> +}
> +
> int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> enum pg_level level, kvm_pfn_t pfn)
> {
> struct page *page = pfn_to_page(pfn);
> int ret;
>
> - /*
> - * HKID is released after all private pages have been removed, and set
> - * before any might be populated. Warn if zapping is attempted when
> - * there can't be anything populated in the private EPT.
> - */
> - if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> - return -EINVAL;
> + if (!is_hkid_assigned(to_kvm_tdx(kvm)))
> + return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn));
>
> ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> if (ret <= 0)
> @@ -3221,6 +3241,19 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> return PG_LEVEL_4K;
> }
>
> +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> + if (kvm_tdx->nr_gmem_inodes >= TDX_MAX_GMEM_INODES)
> + return 0;
> +
> + kvm_tdx->gmem_inodes[kvm_tdx->nr_gmem_inodes++] = inode;
> + ihold(inode);
> +
> + return 1;
> +}
> +
> static int tdx_online_cpu(unsigned int cpu)
> {
> unsigned long flags;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 6b3bebebabfa..fb5c4face131 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -20,6 +20,10 @@ enum kvm_tdx_state {
> TD_STATE_RUNNABLE,
> };
>
> +struct inode;
> +
> +#define TDX_MAX_GMEM_INODES 64
> +
> struct kvm_tdx {
> struct kvm kvm;
>
> @@ -43,6 +47,16 @@ struct kvm_tdx {
> * Set/unset is protected with kvm->mmu_lock.
> */
> bool wait_for_sept_zap;
> +
> + /*
> + * For pages that will not be removed until TD shutdown, the associated
> + * guest_memfd inode is pinned. Allow for a fixed number of pinned
> + * inodes. If there are more, then when the guest_memfd is closed,
> + * their pages will be removed safely but inefficiently prior to
> + * shutdown.
> + */
> + struct inode *gmem_inodes[TDX_MAX_GMEM_INODES];
> + int nr_gmem_inodes;
> };
>
> /* TDX module vCPU states */
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 4704bed033b1..4ee123289d85 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -164,6 +164,7 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
> void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
> void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
> int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
> +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
> #else
> static inline void tdx_disable_virtualization_cpu(void) {}
> static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> @@ -229,6 +230,7 @@ static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
> static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
> static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
> static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; }
> +static inline int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode) { return 0; }
> #endif
>
> #endif /* __KVM_X86_VMX_X86_OPS_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 03db366e794a..96ebf0303223 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13659,6 +13659,13 @@ void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> }
> #endif
>
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> + return kvm_x86_call(gmem_defer_removal)(kvm, inode);
> +}
> +#endif
> +
> int kvm_spec_ctrl_test_value(u64 value)
> {
> /*
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d72ba0a4ca0e..f5c8b1923c24 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2534,6 +2534,11 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
> #endif
>
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +struct inode;
> +bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
> +#endif
> +
> #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> /**
> * kvm_gmem_populate() - Populate/prepare a GPA range with guest data
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 54e959e7d68f..589111505ad0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
> config HAVE_KVM_ARCH_GMEM_INVALIDATE
> bool
> depends on KVM_PRIVATE_MEM
> +
> +config HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> + bool
> + depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b2aa6bf24d3a..cd485f45fdaf 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -248,6 +248,15 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
> return ret;
> }
>
> +inline bool kvm_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> + return kvm_arch_gmem_defer_removal(kvm, inode);
> +#else
> + return false;
> +#endif
> +}
> +
> static int kvm_gmem_release(struct inode *inode, struct file *file)
> {
> struct kvm_gmem *gmem = file->private_data;
> @@ -275,13 +284,16 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
> xa_for_each(&gmem->bindings, index, slot)
> WRITE_ONCE(slot->gmem.file, NULL);
>
> - /*
> - * All in-flight operations are gone and new bindings can be created.
> - * Zap all SPTEs pointed at by this file. Do not free the backing
> - * memory, as its lifetime is associated with the inode, not the file.
> - */
> - kvm_gmem_invalidate_begin(gmem, 0, -1ul);
> - kvm_gmem_invalidate_end(gmem, 0, -1ul);
> + if (!kvm_gmem_defer_removal(kvm, inode)) {
> + /*
> + * All in-flight operations are gone and new bindings can be
> + * created. Zap all SPTEs pointed at by this file. Do not free
> + * the backing memory, as its lifetime is associated with the
> + * inode, not the file.
> + */
> + kvm_gmem_invalidate_begin(gmem, 0, -1ul);
> + kvm_gmem_invalidate_end(gmem, 0, -1ul);
> + }
>
> list_del(&gmem->entry);
>
> --
> 2.43.0
>
Powered by blists - more mailing lists