linux-kernel - Re: [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <910152f4-22b4-4b8d-b3e4-8e044a4d73c9@intel.com>
Date: Tue, 22 Apr 2025 11:13:33 +0300
From: Adrian Hunter <adrian.hunter@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <mlevitsk@...hat.com>, <kvm@...r.kernel.org>,
	<rick.p.edgecombe@...el.com>, <kirill.shutemov@...ux.intel.com>,
	<kai.huang@...el.com>, <reinette.chatre@...el.com>, <xiaoyao.li@...el.com>,
	<tony.lindgren@...ux.intel.com>, <binbin.wu@...ux.intel.com>,
	<isaku.yamahata@...el.com>, <linux-kernel@...r.kernel.org>,
	<yan.y.zhao@...el.com>, <chao.gao@...el.com>
Subject: Re: [PATCH V2 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM

On 19/04/25 04:12, Sean Christopherson wrote:
> On Thu, Apr 17, 2025, Adrian Hunter wrote:
>> From: Sean Christopherson <seanjc@...gle.com>
>>
>> Add sub-ioctl KVM_TDX_TERMINATE_VM to release the HKID prior to shutdown,
>> which enables more efficient reclaim of private memory.
>>
>> Private memory is removed from MMU/TDP when guest_memfds are closed. If
>> the HKID has not been released, the TDX VM is still in RUNNABLE state,
>> so pages must be removed using "Dynamic Page Removal" procedure (refer
>> TDX Module Base spec) which involves a number of steps:
>> 	Block further address translation
>> 	Exit each VCPU
>> 	Clear Secure EPT entry
>> 	Flush/write-back/invalidate relevant caches
>>
>> However, when the HKID is released, the TDX VM moves to TD_TEARDOWN state
>> where all TDX VM pages are effectively unmapped, so pages can be reclaimed
>> directly.
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time.  For example:
>>
>> 	VCPUs	Size (GB)	Before (secs)	After (secs)
>> 	 4	 18		  72		 24
>> 	32	107		 517		134
>> 	64	400		5539		467
>>
>> [Adrian: wrote commit message, added KVM_TDX_TERMINATE_VM documentation,
>>  and moved cpus_read_lock() inside kvm->lock for consistency as reported
>>  by lockdep]
> 
> /facepalm
> 
> I over-thought this.  We've had an long-standing battle with kvm_lock vs.
> cpus_read_lock(), but this is kvm->lock, not kvm_lock.  /sigh
> 
>> +static int tdx_terminate_vm(struct kvm *kvm)
>> +{
>> +	int r = 0;
>> +
>> +	guard(mutex)(&kvm->lock);
> 
> With kvm->lock taken outside cpus_read_lock(), just handle KVM_TDX_TERMINATE_VM
> in the switch statement, i.e. let tdx_vm_ioctl() deal with kvm->lock.

Ok, also cpus_read_lock() can go back where it was in __tdx_release_hkid().

But also in __tdx_release_hkid(), there is

	if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate, kvm))
		return;

However, __tdx_td_init() calls tdx_mmu_release_hkid() on the
error path so that is not correct.

> 
>> +	cpus_read_lock();
>> +
>> +	if (!kvm_trylock_all_vcpus(kvm)) {
>> +		r = -EBUSY;
>> +		goto out;
>> +	}
>> +
>> +	kvm_vm_dead(kvm);
>> +	kvm_unlock_all_vcpus(kvm);
>> +
>> +	__tdx_release_hkid(kvm, true);
>> +out:
>> +	cpus_read_unlock();
>> +	return r;
>> +}
>> +
>>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>>  {
>>  	struct kvm_tdx_cmd tdx_cmd;
>> @@ -2805,6 +2827,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>>  	if (tdx_cmd.hw_error)
>>  		return -EINVAL;
>>  
>> +	if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
>> +		return tdx_terminate_vm(kvm);
>> +
>>  	mutex_lock(&kvm->lock);
>>  
>>  	switch (tdx_cmd.id) {
>> -- 
>> 2.43.0
>>