linux-kernel - Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCyqJTSDTKt1xiKr@google.com>
Date: Tue, 20 May 2025 09:13:25 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, Reinette Chatre <reinette.chatre@...el.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault
 retry on invalid slot

On Tue, May 20, 2025, Yan Zhao wrote:
> On Mon, May 19, 2025 at 10:06:22AM -0700, Sean Christopherson wrote:
> > On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> > > > kicking vCPUs out of KVM?
> > > > 
> > > > Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> > > > KVM can simply drop and reacquire SRCU in the relevant paths.
> > > 
> > > During the initial debugging and kicking around stage, this is the first
> > > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > > came up with the version in this series, which we held review on for the list:
> > 
> > Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> > 
> > > > However, upon further consideration, I am reluctant to implement this fix for
> > 
> > Which fix?
> > 
> > > > the following reasons:
> > > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > > - While retrying with srcu unlock and lock can workaround the
> > > >   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > > >   and tdx_handle_ept_violation() faulting with different memslot layouts.
> > 
> > This behavior has existed since pretty much the beginning of KVM time.  TDX is the
> > oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
> > RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
> > RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> > 
> > Arguably, _TDX_ is buggy by not providing this behavior.
> > 
> > > I'm not sure why the second one is really a problem. For the first one I think
> > > that path could just take the scru lock in the proper order with kvm-
> > > >slots_lock?
> > 
> > Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
> > problematic, as KVM synchronizes SRCU while holding slots_lock.
> > 
> > That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> > hack.  What about something like this?
> So you want to avoid acquiring SRCU in the kvm_gmem_populate() path?

Yes, ideally.  Acquiring SCRU wouldn't be the end of the world, but I don't love
the idea of taking a lock just so that the lock can be conditionally dropped in
a common flow.  It's not a deal breaker (I wouldn't be surprised if there's at
least one path in KVM that acquires SRCU purely because of such behavior), but
separating kvm_tdp_prefault_page() from kvm_tdp_map_page() 

> Generally I think it's good, except that it missed a kvm_mmu_reload() (please
> refer to my comment below) and the kvm_vcpu_srcu_read_{un}lock() pair in
> tdx_handle_ept_violation() path (So, Reinette reported it failed the TDX stress
> tests at [1]).

> > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> >  
> > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > +{
> > +	int r;
> > +
> > +	/*
> > +	 * Restrict to TDP page fault, since that's the only case where the MMU
> > +	 * is indexed by GPA.
> > +	 */
> > +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > +		return -EOPNOTSUPP;
> > +
> > +	for (;;) {
> > +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > +		if (r != -EAGAIN)
> > +			break;
> > +
> > +		/* Comment goes here. */
> > +		kvm_vcpu_srcu_read_unlock(vcpu);
> > +		kvm_vcpu_srcu_read_lock(vcpu);
> For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> the memslot removal succeeds after releasing the SRCU, then the old root is
> stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> from being always true.

That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
otherwise kvm_mmu_reload() will do nothing.

Thinking about this scenario more, I don't mind punting this problem to userspace
for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
to userspace.  Whereas for KVM_RUN, KVM can't change existing behavior without
breaking userspace, should provide consistent behavior regardless of VM type, and
KVM needs the "complex" code irrespective of this particular scenario.

I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
because then KVM is effectively providing the same overall behavior as KVM_RUN,
just without slightly different roles and responsibilities between KVM and
userspace.  And -ENOENT is also flat out wrong for the case where a memslot is
being moved, but the new base+size still contains the to-be-faulted GPA.

I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
encountered during prefault (as identified by fault->prefetch).

For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
i.e. punting to userspace is not a viable option.  But that path also has options
that aren't available to prefaulting.  E.g. it could (and probably should) break
early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario.  And as Rick
called out, the zero-step mess really needs to be solved in a more robust fashion.

So this?

---
From: Sean Christopherson <seanjc@...gle.com>
Date: Tue, 20 May 2025 07:55:32 -0700
Subject: [PATCH] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves
 memslot during prefault

Return -EAGAIN if userspace attemps to delete or move a memslot while also
prefaulting memory for that same memslot, i.e. force userspace to retry
instead of trying to handle the scenario entirely within KVM.  Unlike
KVM_RUN, which needs to handle the scenario entirely within KVM because
userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can
return -EAGAIN without breaking userspace as this scenario can't have ever
worked (and there's no sane use case for prefaulting to a memslot that's
being deleted/moved).

And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee
forward progress.  E.g. to handle such a scenario, KVM would need to drop
and reacquire SRCU to break the deadlock between the memslot update
(synchronizes SRCU) and the prefault (waits for the memslot update to
complete).

However, dropping SRCU creates more problems, as completing the memslot
update will bump the memslot generation, which in turn will invalidate the
MMU root.  To handle that, prefaulting would need to handle pending
KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to
mapping each individual.

I.e. to fully handle this scenario, prefaulting would eventually need to
look a lot like vcpu_enter_guest().  Given that there's no reasonable use
case and practically zero risk of breaking userspace, punt the problem to
userspace and avoid adding unnecessary complexity to the prefualt path.

Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is
held for the entire duration of populate(), i.e. any memslot modifications
will be fully serialized against TDX's flavor of prefaulting.

Reported-by: Reinette Chatre <reinette.chatre@...el.com>
Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com
Debugged-by: Yan Zhao <yan.y.zhao@...el.com>
Signed-off-by: Sean Christopherson <seanjc@...gle.com>
---
 arch/x86/kvm/mmu/mmu.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a284dce227a0..7ae56a3c7607 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4595,10 +4595,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
 	 * or moved.  This ensures any existing SPTEs for the old memslot will
-	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.
+	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.  Punt the
+	 * error to userspace if this is a prefault, as KVM's prefaulting ABI
+	 * doesn't need provide the same forward progress guarantees as KVM_RUN.
 	 */
-	if (slot->flags & KVM_MEMSLOT_INVALID)
+	if (slot->flags & KVM_MEMSLOT_INVALID) {
+		if (fault->prefetch)
+			return -EAGAIN;
+
 		return RET_PF_RETRY;
+	}
 
 	if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
 		/*

base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
--