[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z7hdqCqOLJWcR71K@yzhao56-desk.sh.intel.com>
Date: Fri, 21 Feb 2025 19:04:08 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Paolo Bonzini <pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>,
<kvm@...r.kernel.org>, Rick Edgecombe <rick.p.edgecombe@...el.com>, "Isaku
Yamahata" <isaku.yamahata@...el.com>, Tony Lindgren
<tony.lindgren@...ux.intel.com>, Sean Christopherson
<sean.j.christopherson@...el.com>, Kai Huang <kai.huang@...el.com>
Subject: Re: [PATCH 20/30] KVM: TDX: create/destroy VM structure
On Thu, Feb 20, 2025 at 04:55:18PM -0800, Sean Christopherson wrote:
> TL;DR: Please don't merge this patch to kvm/next or kvm/queue.
>
> On Thu, Feb 20, 2025, Paolo Bonzini wrote:
> > @@ -72,8 +94,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> > .has_emulated_msr = vmx_has_emulated_msr,
> >
> > .vm_size = sizeof(struct kvm_vmx),
> > - .vm_init = vmx_vm_init,
> > - .vm_destroy = vmx_vm_destroy,
> > +
> > + .vm_init = vt_vm_init,
> > + .vm_destroy = vt_vm_destroy,
> > + .vm_free = vt_vm_free,
> >
> > .vcpu_precreate = vmx_vcpu_precreate,
> > .vcpu_create = vmx_vcpu_create,
>
> ...
>
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 374d89e6663f..e0b9b845df58 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12884,6 +12884,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> > kvm_page_track_cleanup(kvm);
> > kvm_xen_destroy_vm(kvm);
> > kvm_hv_destroy_vm(kvm);
> > + static_call_cond(kvm_x86_vm_free)(kvm);
> > }
>
> Sorry to throw a wrench in things, but I have a fix that I want to send for 6.14[1],
> i.e. before this code, and to land that fix I need/want to destroy vCPUs before
> calling kvm_x86_ops.vm_destroy(). *sigh*
>
> The underlying issue is that both nVMX and nSVM suck and access all manner of VM-wide
> state when destroying a vCPU that is currently in nested guest mode, and I want
> to fix the most pressing issue of destroying vCPUs at a random time once and for
> all. nVMX and nSVM also need to be cleaned up to not access so much darn state,
> but I'm worried that "fixing" the nested cases will only whack the biggest mole.
>
> Commit 6fcee03df6a1 ("KVM: x86: avoid loading a vCPU after .vm_destroy was called")
> papered over an AVIC case, but there are issues, e.g. with the MSR filters[2],
> and the NULL pointer deref that's blocking the aforementioned fix is a nVMX access
> to the PIC.
>
> I haven't fully tested destroying vCPUs before calling vm_destroy(), but I can't
> see anything in vmx_vm_destroy() or svm_vm_destroy() that expects to run while
> vCPUs are still alive. If anything, it's opposite, e.g. freeing VMX's IPIv PID
> table before vCPUs are destroyed is blatantly unsafe.
Is it possible to simply move the code like freeing PID table from
vmx_vm_destroy() to static_call_cond(kvm_x86_vm_free)(kvm) ?
>
> The good news is, I think it'll lead to a better approach (and naming). KVM already
> frees MMU state before vCPU state, because while MMUs are largely VM-scoped, all
> of the common MMU state needs to be freed before any one vCPU is freed.
>
> And so my plan is to carved out a kvm_destroy_mmus() helper, which can then call
> the TDX hook to release/reclaim the HKID, which I assume needs to be done after
> KVM's general MMU destruction, but before vCPUs are freed.
>
> I'll make sure to Cc y'all on the series (typing and testing furiously to try and
> get it out asap). But to try and avoid posting code that's not usable for TDX,
> will this work?
>
> static void kvm_destroy_mmus(struct kvm *kvm)
> {
> struct kvm_vcpu *vcpu;
> unsigned long i;
>
> if (current->mm == kvm->mm) {
> /*
> * Free memory regions allocated on behalf of userspace,
> * unless the memory map has changed due to process exit
> * or fd copying.
> */
> mutex_lock(&kvm->slots_lock);
> __x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
> 0, 0);
> __x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
> 0, 0);
> __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
> mutex_unlock(&kvm->slots_lock);
> }
>
> kvm_for_each_vcpu(i, vcpu, kvm) {
> kvm_clear_async_pf_completion_queue(vcpu);
> kvm_unload_vcpu_mmu(vcpu);
> }
>
> kvm_x86_call(mmu_destroy)(kvm);
I suppose this will hook tdx_mmu_release_hkid() ?
> }
>
> void kvm_arch_pre_destroy_vm(struct kvm *kvm)
> {
> kvm_mmu_pre_destroy_vm(kvm);
> }
>
> void kvm_arch_destroy_vm(struct kvm *kvm)
> {
> /*
> * WARNING! MMUs must be destroyed before vCPUs, and vCPUs must be
> * destroyed before any VM state. Most MMU state is VM-wide, but is
> * tracked per-vCPU, and so must be unloaded/freed in its entirety
> * before any one vCPU is destroyed. For all other VM state, vCPUs
> * expect to be able to access VM state until the vCPU is freed.
> */
> kvm_destroy_mmus(kvm);
> kvm_destroy_vcpus(kvm);
> kvm_x86_call(vm_destroy)(kvm);
>
> ...
> }
>
> [1] https://lore.kernel.org/all/Z66RC673dzlq2YuA@google.com
> [2] https://lore.kernel.org/all/20240703175618.2304869-2-aaronlewis@google.com
Powered by blists - more mailing lists