linux-kernel - Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d7e220ab-3000-408b-9dd6-0e7ee06d79ec@intel.com>
Date: Fri, 28 Mar 2025 12:26:19 +0200
From: Adrian Hunter <adrian.hunter@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <kvm@...r.kernel.org>,
	<rick.p.edgecombe@...el.com>, <kirill.shutemov@...ux.intel.com>,
	<kai.huang@...el.com>, <reinette.chatre@...el.com>, <xiaoyao.li@...el.com>,
	<tony.lindgren@...ux.intel.com>, <binbin.wu@...ux.intel.com>,
	<isaku.yamahata@...el.com>, <linux-kernel@...r.kernel.org>,
	<yan.y.zhao@...el.com>, <chao.gao@...el.com>
Subject: Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease
 shutdown time

On 27/03/25 17:54, Sean Christopherson wrote:
> On Thu, Mar 13, 2025, Adrian Hunter wrote:
>> Improve TDX shutdown performance by adding a more efficient shutdown
>> operation at the cost of adding separate branches for the TDX MMU
>> operations for normal runtime and shutdown.  This more efficient method was
>> previously used in earlier versions of the TDX patches, but was removed to
>> simplify the initial upstreaming.  This is an RFC, and still needs a proper
>> upstream commit log. It is intended to be an eventual follow up to base
>> support.
> 
> ...
> 
>> == Options ==
>>
>>   1. Start TD teardown earlier so that when pages are removed,
>>   they can be reclaimed faster.
>>   2. Defer page removal until after TD teardown has started.
>>   3. A combination of 1 and 2.
>>
>> Option 1 is problematic because it means putting the TD into a non-runnable
>> state while it is potentially still active. Also, as mentioned above, Sean
>> effectively NAK'ed it.
> 
> Option 2 is just as gross, arguably even worse.  I NAK'd a flavor of option 1,
> not the base concept of initiating teardown before all references to the VM are
> put.
> 
> AFAICT, nothing outright prevents adding a TDX sub-ioctl to terminate the VM.
> The locking is a bit heinous, but I would prefer heavy locking to deferring
> reclaim and pinning inodes.
> 
> Oh FFS.  This is also an opportunity to cleanup RISC-V's insidious copy-paste of
> ARM.  Because extracting (un)lock_all_vcpus() to common code would have been sooo
> hard.  *sigh*
> 
> Very roughly, something like the below (*completely* untested).

Thank you!  I will give it a test.

> 
> An alternative to taking mmu_lock would be to lock all bound guest_memfds, but I
> think I prefer taking mmu_lock is it's easier to reason about the safety of freeing
> the HKID.  Note, the truncation phase of a PUNCH_HOLE could still run in parallel,
> but that's a-ok.  The only part of PUNCH_HOLE that needs to be blocked is the call
> to kvm_mmu_unmap_gfn_range().
> 
> ---
>  arch/x86/kvm/vmx/tdx.c | 61 ++++++++++++++++++++++++++++++------------
>  1 file changed, 44 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 87f188021cbd..6fb595c272ab 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -472,7 +472,7 @@ static void smp_func_do_phymem_cache_wb(void *unused)
>  		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err);
>  }
>  
> -void tdx_mmu_release_hkid(struct kvm *kvm)
> +static void __tdx_release_hkid(struct kvm *kvm, bool terminate)
>  {
>  	bool packages_allocated, targets_allocated;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -485,10 +485,11 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>  	if (!is_hkid_assigned(kvm_tdx))
>  		return;
>  
> +	if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate))
> +		return;
> +
>  	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
>  	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> -	cpus_read_lock();
> -
>  	kvm_for_each_vcpu(j, vcpu, kvm)
>  		tdx_flush_vp_on_cpu(vcpu);
>  
> @@ -500,12 +501,8 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>  	 */
>  	mutex_lock(&tdx_lock);
>  
> -	/*
> -	 * Releasing HKID is in vm_destroy().
> -	 * After the above flushing vps, there should be no more vCPU
> -	 * associations, as all vCPU fds have been released at this stage.
> -	 */
>  	err = tdh_mng_vpflushdone(&kvm_tdx->td);
> +	/* Uh, what's going on here? */
>  	if (err == TDX_FLUSHVP_NOT_DONE)
>  		goto out;

Looks like it should not be needed.

First introduced here:

	[PATCH v16 071/116] KVM: TDX: handle vcpu migration over logical processor
	https://lore.kernel.org/af6cb5e6abae5a19d6d2eeb452d29a96233e5a44.1697471314.git.isaku.yamahata@intel.com

Explained a bit

	Re: [PATCH v17 071/116] KVM: TDX: handle vcpu migration over logical processor
	https://lore.kernel.org/20231117080804.GF1277973@ls.amr.corp.intel.com

And explained a bit more:

	Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor
	https://lore.kernel.org/all/20240415224828.GS3039520@ls.amr.corp.intel.com/

>  	if (KVM_BUG_ON(err, kvm)) {
> @@ -515,6 +512,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>  		goto out;
>  	}
>  
> +	write_lock(&kvm->mmu_lock);
>  	for_each_online_cpu(i) {
>  		if (packages_allocated &&
>  		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
> @@ -539,14 +537,21 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>  	} else {
>  		tdx_hkid_free(kvm_tdx);
>  	}
> -
> +	write_unlock(&kvm->mmu_lock);
>  out:
>  	mutex_unlock(&tdx_lock);
> -	cpus_read_unlock();
>  	free_cpumask_var(targets);
>  	free_cpumask_var(packages);
>  }
>  
> +void tdx_mmu_release_hkid(struct kvm *kvm)
> +{
> +	cpus_read_lock();
> +	__tdx_release_hkid(kvm, false);
> +	cpus_read_unlock();
> +}
> +
> +
>  static void tdx_reclaim_td_control_pages(struct kvm *kvm)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -1789,13 +1794,10 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>  	struct page *page = pfn_to_page(pfn);
>  	int ret;
>  
> -	/*
> -	 * HKID is released after all private pages have been removed, and set
> -	 * before any might be populated. Warn if zapping is attempted when
> -	 * there can't be anything populated in the private EPT.
> -	 */
> -	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> -		return -EINVAL;
> +	if (!is_hkid_assigned(to_kvm_tdx(kvm)), kvm) {
> +		WARN_ON_ONCE(!kvm->vm_dead);
> +		return tdx_reclaim_page(pfn_to_page(pfn));
> +	}
>  
>  	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>  	if (ret <= 0)
> @@ -2790,6 +2792,28 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	return 0;
>  }
>  
> +static int tdx_td_terminate(struct kvm *kvm)
> +{
> +	struct kvm_memory_slot *slot;
> +	struct kvm_memslots *slots;
> +	int bkt;
> +
> +	cpus_read_lock();
> +	guard(mutex)(&kvm->lock);
> +
> +	r = kvm_lock_all_vcpus();
> +	if (r)
> +		goto out;
> +
> +	kvm_vm_dead(kvm);
> +	kvm_unlock_all_vcpus();
> +
> +	__tdx_release_hkid(kvm);
> +out:
> +	cpus_read_unlock();
> +	return r;
> +}
> +
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_tdx_cmd tdx_cmd;
> @@ -2805,6 +2829,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  	if (tdx_cmd.hw_error)
>  		return -EINVAL;
>  
> +	if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
> +		return tdx_td_terminate(kvm);
> +
>  	mutex_lock(&kvm->lock);
>  
>  	switch (tdx_cmd.id) {
> 
> base-commit: 2156c3c7d60c5be9c0d9ab1fedccffe3c55a2ca0