linux-kernel - Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4b6918e4-adba-48b2-931c-4d428a2775fc@intel.com>
Date: Thu, 19 Jun 2025 14:12:37 +0300
From: Adrian Hunter <adrian.hunter@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Vishal Annapurve <vannapurve@...gle.com>, <pbonzini@...hat.com>,
	<kvm@...r.kernel.org>, <rick.p.edgecombe@...el.com>,
	<kirill.shutemov@...ux.intel.com>, <kai.huang@...el.com>,
	<reinette.chatre@...el.com>, <xiaoyao.li@...el.com>,
	<tony.lindgren@...ux.intel.com>, <binbin.wu@...ux.intel.com>,
	<isaku.yamahata@...el.com>, <linux-kernel@...r.kernel.org>,
	<yan.y.zhao@...el.com>, <chao.gao@...el.com>
Subject: Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM

On 19/06/2025 03:33, Sean Christopherson wrote:
> On Wed, Jun 18, 2025, Adrian Hunter wrote:
>> On 18/06/2025 09:00, Vishal Annapurve wrote:
>>> On Tue, Jun 17, 2025 at 10:50 PM Adrian Hunter <adrian.hunter@...el.com> wrote:
>>>>> Ability to clean up memslots from userspace without closing
>>>>> VM/guest_memfd handles is useful to keep reusing the same guest_memfds
>>>>> for the next boot iteration of the VM in case of reboot.
>>>>
>>>> TD lifecycle does not include reboot.  In other words, reboot is
>>>> done by shutting down the TD and then starting again with a new TD.
>>>>
>>>> AFAIK it is not currently possible to shut down without closing
>>>> guest_memfds since the guest_memfd holds a reference (users_count)
>>>> to struct kvm, and destruction begins when users_count hits zero.
>>>>
>>>
>>> gmem link support[1] allows associating existing guest_memfds with new
>>> VM instances.
>>>
>>> Breakdown of the userspace VMM flow:
>>> 1) Create a new VM instance before closing guest_memfd files.
>>> 2) Link existing guest_memfd files with the new VM instance. -> This
>>> creates new set of files backed by the same inode but associated with
>>> the new VM instance.
>>
>> So what about:
>>
>> 2.5) Call KVM_TDX_TERMINATE_VM IOCTL
>>
>> Memory reclaimed after KVM_TDX_TERMINATE_VM will be done efficiently,
>> so avoid causing it to be reclaimed earlier.
> 
> The problem is that setting kvm->vm_dead will prevent (3) from succeeding.  If
> kvm->vm_dead is set, KVM will reject all vCPU, VM, and device (not /dev/kvm the
> device, but rather devices bound to the VM) ioctls.

(3) is "Close the older guest memfd handles -> results in older VM instance cleanup."

close() is not an IOCTL, so I do not understand.

> 
> I intended that behavior, e.g. to guard against userspace blowing up KVM because
> the hkid was released, I just didn't consider the memslots angle.

The patch was tested with QEMU which AFAICT does not touch  memslots when
shutting down.  Is there a reason to?

Obviously memslots still need to be freed which is done by kvm_destroy_vm().

> 
> The other thing I didn't consider at the time, is that vm_dead doesn't fully
> protect against ioctls that are already in flight.  E.g. see commit
> 17c80d19df6f ("KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation
> is in-flight").  Though without a failure/splat of some kind, it's impossible to
> to know if this is actually a problem.  I.e. I don't think we should *entirely*
> scrap blocking ioctls just because it *might* not be perfect (we can always make
> KVM better).
> 
> I can think of a few options:
> 
>  1. Skip the vm_dead check if userspace is deleting a memslot.
>  2. Provide a way for userspace to delete all memslots, and have that bypass
>     vm_dead.
>  3. Delete all memslots on  KVM_TDX_TERMINATE_VM.
>  4. Remove vm_dead and instead reject ioctls based on vm_bugged, and simply rely
>     on KVM_REQ_VM_DEAD to prevent running the guest.  I.e. tweak kvm_vm_dead()
>     to be that it only prevents running the VM, and have kvm_vm_bugged() be the
>     "something is badly broken, try to limit the damage".
> 
> I'm heavily leaning toward #4.  #1 is doable, but painful.  #2 is basically #1,
> but with new uAPI.  #3 is all kinds of gross, e.g. userspace might want to simply
> kill the VM and move on.  KVM would still block ioctls, but only if a bug was
> detected.  And the few use cases where KVM just wants to prevent entering the
> guest won't prevent gracefully tearing down the VM.
> 
> Hah!  And there's actually a TDX bug fix here, because "checking" for KVM_REQ_VM_DEAD
> in kvm_tdp_map_page() and tdx_handle_ept_violation() will *clear* the request,
> which isn't what we want, e.g. a vCPU could actually re-enter the guest at that
> point.
> 
> This is what I'm thinking.  If I don't hate it come Friday (or Monday), I'll turn
> this patch into a mini-series and post v5.
> 
> Adrian, does that work for you?

Might need some more checks - I will have to look more closely.

> 
> ---
>  arch/arm64/kvm/arm.c            |  2 +-
>  arch/arm64/kvm/vgic/vgic-init.c |  2 +-
>  arch/x86/kvm/x86.c              |  2 +-
>  include/linux/kvm_host.h        |  2 --
>  virt/kvm/kvm_main.c             | 10 +++++-----
>  5 files changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index de2b4e9c9f9f..18bd80388b59 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1017,7 +1017,7 @@ static int kvm_vcpu_suspend(struct kvm_vcpu *vcpu)
>  static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  {
>  	if (kvm_request_pending(vcpu)) {
> -		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
> +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
>  			return -EIO;
>  
>  		if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
> index eb1205654ac8..c2033bae73b2 100644
> --- a/arch/arm64/kvm/vgic/vgic-init.c
> +++ b/arch/arm64/kvm/vgic/vgic-init.c
> @@ -612,7 +612,7 @@ int kvm_vgic_map_resources(struct kvm *kvm)
>  	mutex_unlock(&kvm->arch.config_lock);
>  out_slots:
>  	if (ret)
> -		kvm_vm_dead(kvm);
> +		kvm_vm_bugged(kvm);
>  
>  	mutex_unlock(&kvm->slots_lock);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b58a74c1722d..37f835d77b65 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10783,7 +10783,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	bool req_immediate_exit = false;
>  
>  	if (kvm_request_pending(vcpu)) {
> -		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
> +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
>  			r = -EIO;
>  			goto out;
>  		}
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3bde4fb5c6aa..56898e4ab524 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -853,7 +853,6 @@ struct kvm {
>  	u32 dirty_ring_size;
>  	bool dirty_ring_with_bitmap;
>  	bool vm_bugged;
> -	bool vm_dead;
>  
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  	struct notifier_block pm_notifier;
> @@ -893,7 +892,6 @@ struct kvm {
>  
>  static inline void kvm_vm_dead(struct kvm *kvm)
>  {
> -	kvm->vm_dead = true;
>  	kvm_make_all_cpus_request(kvm, KVM_REQ_VM_DEAD);
>  }
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index eec82775c5bf..4220579a9a74 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4403,7 +4403,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  	struct kvm_fpu *fpu = NULL;
>  	struct kvm_sregs *kvm_sregs = NULL;
>  
> -	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
> +	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
>  		return -EIO;
>  
>  	if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
> @@ -4646,7 +4646,7 @@ static long kvm_vcpu_compat_ioctl(struct file *filp,
>  	void __user *argp = compat_ptr(arg);
>  	int r;
>  
> -	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
> +	if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged)
>  		return -EIO;
>  
>  	switch (ioctl) {
> @@ -4712,7 +4712,7 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
>  {
>  	struct kvm_device *dev = filp->private_data;
>  
> -	if (dev->kvm->mm != current->mm || dev->kvm->vm_dead)
> +	if (dev->kvm->mm != current->mm || dev->kvm->vm_bugged)
>  		return -EIO;
>  
>  	switch (ioctl) {
> @@ -5131,7 +5131,7 @@ static long kvm_vm_ioctl(struct file *filp,
>  	void __user *argp = (void __user *)arg;
>  	int r;
>  
> -	if (kvm->mm != current->mm || kvm->vm_dead)
> +	if (kvm->mm != current->mm || kvm->vm_bugged)
>  		return -EIO;
>  	switch (ioctl) {
>  	case KVM_CREATE_VCPU:
> @@ -5395,7 +5395,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
>  	struct kvm *kvm = filp->private_data;
>  	int r;
>  
> -	if (kvm->mm != current->mm || kvm->vm_dead)
> +	if (kvm->mm != current->mm || kvm->vm_bugged)
>  		return -EIO;
>  
>  	r = kvm_arch_vm_compat_ioctl(filp, ioctl, arg);
> 
> base-commit: 8046d29dde17002523f94d3e6e0ebe486ce52166
> --