lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250519023737.30360-1-yan.y.zhao@intel.com>
Date: Mon, 19 May 2025 10:37:37 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: pbonzini@...hat.com,
	seanjc@...gle.com
Cc: reinette.chatre@...el.com,
	rick.p.edgecombe@...el.com,
	linux-kernel@...r.kernel.org,
	kvm@...r.kernel.org,
	Yan Zhao <yan.y.zhao@...el.com>
Subject: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot

Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of
kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot.
This helps prevent deadlocks when a memslot is removed during pre-faulting
GPAs in the memslot or local retry of faulting private pages in TDX.

Take pre-faulting as an example.

During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the
pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory()
further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault()
if the return value is RET_PF_RETRY.

If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after
kvm_invalidate_memslot() marks a slot as invalid and makes it visible via
rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault()
may encounter an invalid slot and return RET_PF_RETRY. Consequently,
kvm_tdp_map_page() will then retry without releasing the srcu lock.
Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is
blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock,
leading to a deadlock.

"slot deleting" thread                   "prefault" thread
-----------------------------            ----------------------
                                         srcu_read_lock();
(A)
invalid_slot->flags |= KVM_MEMSLOT_INVALID;
rcu_assign_pointer();

                                         kvm_tdp_map_page();
                                         (B)
                                            do {
                                               r = kvm_mmu_do_page_fault();

(C) synchronize_srcu_expedited();

                                            } while (r == RET_PF_RETRY);

                                         (D) srcu_read_unlock();

As shown in diagram, (C) is waiting for (D). However, (B) continuously
finds an invalid slot before (C) completes, causing (B) to retry and
preventing (D) from being invoked.

The local retry code in TDX's EPT violation handler faces a similar issue,
where a deadlock can occur when faulting a private GFN in a slot that is
concurrently being removed.

To resolve the deadlock, introduce a new return value
RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return
RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an
invalid memslot. This prevents endless retries in kvm_tdp_map_page() or
tdx_handle_ept_violation(), allowing the srcu to be released and enabling
slot removal to proceed.

As all callers of kvm_tdp_map_page(), i.e.,
kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in
pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE
to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of
the slot removal.

Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not
affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their
callers either only check if the return value > 0 to re-enter vCPU for
retry or do not check return value.

Reported-by: Reinette Chatre <reinette.chatre@...el.com>
Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
---
 arch/x86/kvm/mmu/mmu.c          | 3 ++-
 arch/x86/kvm/mmu/mmu_internal.h | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..3331e1e1aa69 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4599,7 +4599,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.
 	 */
 	if (slot->flags & KVM_MEMSLOT_INVALID)
-		return RET_PF_RETRY;
+		return RET_PF_RETRY_INVALID_SLOT;
 
 	if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
 		/*
@@ -4879,6 +4879,7 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 		return 0;
 
 	case RET_PF_EMULATE:
+	case RET_PF_RETRY_INVALID_SLOT:
 		return -ENOENT;
 
 	case RET_PF_RETRY:
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de62..1aa14a32225e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -311,6 +311,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
  * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
+ * RET_PF_RETRY_INVALID_SLOT: Let CPU fault again on the address due to slot
+ *                            with flag KVM_MEMSLOT_INVALID.
  *
  * Any names added to this enum should be exported to userspace for use in
  * tracepoints via TRACE_DEFINE_ENUM() in mmutrace.h
@@ -326,6 +328,7 @@ enum {
 	RET_PF_INVALID,
 	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
+	RET_PF_RETRY_INVALID_SLOT,
 };
 
 /*
-- 
2.43.2


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ