linux-kernel - Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aC5yboPeBkVKR3jo@yzhao56-desk.sh.intel.com>
Date: Thu, 22 May 2025 08:40:14 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Rick P Edgecombe <rick.p.edgecombe@...el.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "pbonzini@...hat.com" <pbonzini@...hat.com>, "Reinette
 Chatre" <reinette.chatre@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for
 fault retry on invalid slot

On Wed, May 21, 2025 at 08:45:21AM -0700, Sean Christopherson wrote:
> On Wed, May 21, 2025, Yan Zhao wrote:
> > On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote:
> > > > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > > > >  
> > > > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > > > +{
> > > > > +	int r;
> > > > > +
> > > > > +	/*
> > > > > +	 * Restrict to TDP page fault, since that's the only case where the MMU
> > > > > +	 * is indexed by GPA.
> > > > > +	 */
> > > > > +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	for (;;) {
> > > > > +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > > > > +		if (r != -EAGAIN)
> > > > > +			break;
> > > > > +
> > > > > +		/* Comment goes here. */
> > > > > +		kvm_vcpu_srcu_read_unlock(vcpu);
> > > > > +		kvm_vcpu_srcu_read_lock(vcpu);
> > > > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> > > > the memslot removal succeeds after releasing the SRCU, then the old root is
> > > > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> > > > from being always true.
> > > 
> > > That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> > > otherwise kvm_mmu_reload() will do nothing.
> > In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in
> > kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in
> > kvm_mmu_reload().
> 
> Oh, right!  I completely forgot about that.  Hmm, that reduces the complexity a
> little bit, but I'm still leaning towards punting -EAGAIN to userspace.
> 
> > > Thinking about this scenario more, I don't mind punting this problem to userspace
> > > for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
> > > because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
> > > to userspace.  Whereas for KVM_RUN, KVM can't change existing behavior without
> > > breaking userspace, should provide consistent behavior regardless of VM type, and
> > > KVM needs the "complex" code irrespective of this particular scenario.
> > > 
> > > I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
> > > because then KVM is effectively providing the same overall behavior as KVM_RUN,
> > > just without slightly different roles and responsibilities between KVM and
> > > userspace.  And -ENOENT is also flat out wrong for the case where a memslot is
> > > being moved, but the new base+size still contains the to-be-faulted GPA.
> > > 
> > > I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
> > > into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
> > > encountered during prefault (as identified by fault->prefetch).
> > >
> > > For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
> > > i.e. punting to userspace is not a viable option.  But that path also has options
> > > that aren't available to prefaulting.  E.g. it could (and probably should) break
> > > early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
> > Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot
> > removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control
> > memslot zap behavior").
> > 
> > > would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario.  And as Rick
> > > called out, the zero-step mess really needs to be solved in a more robust fashion.
> > > 
> > > So this?
> > Looks good to me for non-TDX side.
> > 
> > For TDX, could we provide below fix based on your change?
> 
> Hmm, I'd prefer not to, mainly because I don't want to special case things even
> more in the common MMU code, e.g. I don't want to bleed the "no memslot == exit"
> logic into multiple locations.  And very strictly speaking, a memory fault exit
> isn't guaranteed, as userspace could set a new memory region before the vCPU
> retries the fault.
> 
> Returning -EAGAIN isn't an option because that would break userspace (e.g. our
> VMM doesn't handle EAGAIN and supports SNP), and documenting the behavior would
> be weird.  For KVM_PRE_FAULT_MEMORY, KVM's documentation can simply state that
> EAGAIN is returned KVM encounters temporary resource contention and that userspace
> should simply try again.  It's an ABI change, but for a nascent ioctl() and a
> scenario that won't be hit in practice, so I'm confident we can make the change
> without breaking userspace.
> 
> And again, this is an unfortunate side effect of zero-step; there's no such
> restriction for SNP, and ideally the TDX zero-step pain will be solved and this
> would also go away for TDX too, so I'm hesitant to bake this behavior into KVM's
> ABI.
> 
> My best idea is to special case this in tdx_handle_ept_violation().  It's also
> very gross, but at least the nastiness is limited to the zero-step mitigation
> mess, and is co-located with the code that doesn't actually play nice with
> RET_PF_RETRY.  E.g.
Thank you, Sean.
We'll go down this path.

> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..ca47d08ae112 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1907,6 +1907,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>          * handle retries locally in their EPT violation handlers.
>          */
>         while (1) {
> +               struct kvm_memory_slot *slot;
> +
>                 ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
>  
>                 if (ret != RET_PF_RETRY || !local_retry)
> @@ -1920,6 +1922,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>                         break;
>                 }
>  
> +               slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> +               if (slot && slot->flags & KVM_MEMSLOT_INVALID)
> +                       break;
> +
>                 cond_resched();
>         }
>         return ret;
> 
...
> > And would you mind if I included your patch in my next version? I can update the
> > related selftests as well.
> 
> Yes, please do!
Thanks.