linux-kernel - Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aC31EXLNzCVGT0EP@google.com>
Date: Wed, 21 May 2025 08:45:21 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, Reinette Chatre <reinette.chatre@...el.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault
 retry on invalid slot

On Wed, May 21, 2025, Yan Zhao wrote:
> On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote:
> > > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > > >  
> > > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > > +{
> > > > +	int r;
> > > > +
> > > > +	/*
> > > > +	 * Restrict to TDP page fault, since that's the only case where the MMU
> > > > +	 * is indexed by GPA.
> > > > +	 */
> > > > +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	for (;;) {
> > > > +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > > > +		if (r != -EAGAIN)
> > > > +			break;
> > > > +
> > > > +		/* Comment goes here. */
> > > > +		kvm_vcpu_srcu_read_unlock(vcpu);
> > > > +		kvm_vcpu_srcu_read_lock(vcpu);
> > > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> > > the memslot removal succeeds after releasing the SRCU, then the old root is
> > > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> > > from being always true.
> > 
> > That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> > otherwise kvm_mmu_reload() will do nothing.
> In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in
> kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in
> kvm_mmu_reload().

Oh, right!  I completely forgot about that.  Hmm, that reduces the complexity a
little bit, but I'm still leaning towards punting -EAGAIN to userspace.

> > Thinking about this scenario more, I don't mind punting this problem to userspace
> > for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
> > because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
> > to userspace.  Whereas for KVM_RUN, KVM can't change existing behavior without
> > breaking userspace, should provide consistent behavior regardless of VM type, and
> > KVM needs the "complex" code irrespective of this particular scenario.
> > 
> > I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
> > because then KVM is effectively providing the same overall behavior as KVM_RUN,
> > just without slightly different roles and responsibilities between KVM and
> > userspace.  And -ENOENT is also flat out wrong for the case where a memslot is
> > being moved, but the new base+size still contains the to-be-faulted GPA.
> > 
> > I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
> > into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
> > encountered during prefault (as identified by fault->prefetch).
> >
> > For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
> > i.e. punting to userspace is not a viable option.  But that path also has options
> > that aren't available to prefaulting.  E.g. it could (and probably should) break
> > early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
> Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot
> removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control
> memslot zap behavior").
> 
> > would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario.  And as Rick
> > called out, the zero-step mess really needs to be solved in a more robust fashion.
> > 
> > So this?
> Looks good to me for non-TDX side.
> 
> For TDX, could we provide below fix based on your change?

Hmm, I'd prefer not to, mainly because I don't want to special case things even
more in the common MMU code, e.g. I don't want to bleed the "no memslot == exit"
logic into multiple locations.  And very strictly speaking, a memory fault exit
isn't guaranteed, as userspace could set a new memory region before the vCPU
retries the fault.

Returning -EAGAIN isn't an option because that would break userspace (e.g. our
VMM doesn't handle EAGAIN and supports SNP), and documenting the behavior would
be weird.  For KVM_PRE_FAULT_MEMORY, KVM's documentation can simply state that
EAGAIN is returned KVM encounters temporary resource contention and that userspace
should simply try again.  It's an ABI change, but for a nascent ioctl() and a
scenario that won't be hit in practice, so I'm confident we can make the change
without breaking userspace.

And again, this is an unfortunate side effect of zero-step; there's no such
restriction for SNP, and ideally the TDX zero-step pain will be solved and this
would also go away for TDX too, so I'm hesitant to bake this behavior into KVM's
ABI.

My best idea is to special case this in tdx_handle_ept_violation().  It's also
very gross, but at least the nastiness is limited to the zero-step mitigation
mess, and is co-located with the code that doesn't actually play nice with
RET_PF_RETRY.  E.g.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..ca47d08ae112 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1907,6 +1907,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
         * handle retries locally in their EPT violation handlers.
         */
        while (1) {
+               struct kvm_memory_slot *slot;
+
                ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
 
                if (ret != RET_PF_RETRY || !local_retry)
@@ -1920,6 +1922,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
                        break;
                }
 
+               slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
+               if (slot && slot->flags & KVM_MEMSLOT_INVALID)
+                       break;
+
                cond_resched();
        }
        return ret;

> For private fault, -EFAULT will be returned to userspace after the retry anyway
> after the slot is completed removed, which is unlike non-private faults that go
> to emulate path after retry.
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4602,6 +4602,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
>                 if (fault->prefetch)
>                         return -EAGAIN;
> 
> +               if (fault->is_private) {
> +                       kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +                       return -EFAULT;
> +               }
> +
>                 return RET_PF_RETRY;
>         }
> 
> 
> And would you mind if I included your patch in my next version? I can update the
> related selftests as well.

Yes, please do!