lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250519023613.30329-1-yan.y.zhao@intel.com>
Date: Mon, 19 May 2025 10:36:13 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: pbonzini@...hat.com,
	seanjc@...gle.com
Cc: reinette.chatre@...el.com,
	rick.p.edgecombe@...el.com,
	linux-kernel@...r.kernel.org,
	kvm@...r.kernel.org,
	Yan Zhao <yan.y.zhao@...el.com>
Subject: [PATCH 0/2] Introduce RET_PF_RETRY_INVALID_SLOT

Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of
kvm_mmu_do_page_fault() that a fault retry is caused by an invalid memslot.
This helps prevent deadlock when a memslot is removed during pre-faulting
or local retry of faulting private pages in TDX.

Take pre-faulting as an example. Removing a memslot during the ioctl
KVM_PRE_FAULT_MEMORY on x86 can lead to a kernel deadlock.

"slot deleting" thread                    "pre-fault" thread
-----------------------------             ----------------------
                                          srcu_read_lock();
(A)
invalid_slot->flags |= KVM_MEMSLOT_INVALID
rcu_assign_pointer();

                                          kvm_tdp_map_page();
                                          (B)
                                            do {
                                               r = kvm_mmu_do_page_fault();

(C) synchronize_srcu_expedited();

                                            } while (r == RET_PF_RETRY);

                                         (D) srcu_read_unlock();
    
 
The deadlock occurs because (C) is waiting for (D) to complete, but (B)
repeatedly encounters an invalid slot and retries before (C) can finish,
thus preventing (D) from being executed.
(If the srcu_read_lock() in the pre-fault thread is invoked after (A)
and before (C), deadlock can also occur).

The local retry code in TDX's EPT violation handler faces a similar issue,
where deadlock can occur if an invalid slot is found when faulting a
private GFN.

To resolve the deadlock, modify kvm_mmu_do_page_fault() to return
RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an
invalid memslot. This change enables the pre-fault thread or TDX's EPT
violation handler to exit the loop and release the srcu lock.

There're some alternative solutions to address the deadlock.
1) Return -EAGAIN for an invalid slot.
   Cons: This approach involves an uAPI change, requiring userspace to
         ignore error -EAGAIN and re-run the vCPU. While QEMU already
         ignores -EAGAIN, selftests may need updates.

2) Call kvm_mmu_prepare_memory_fault_exit() and return -EFAULT for an
   invalid slot.
   Cons: - kvm_mmu_prepare_memory_fault_exit() is not recognized by
           userspace for normal VMs.
         - Returning -EFAULT when a memslot is still invalid (i.e., not
           completely removed) may confuse userspace.

3) Release the srcu lock before cond_resched() and re-acquire it afterward
   in kvm_tdp_map_page() and tdx_handle_ept_violation():
   Cons: - It allows each kvm_vcpu_pre_fault_memory() to pre-fault memory
           with different memslot layouts.
         - It requires tdx_gmem_post_populate() to acquire the srcu lock
           before invoking kvm_tdp_map_page(), which is redundant since
           tdx_gmem_post_populate() already holds the kvm->slots_lock.


Patch 1 opts to introduce RET_PF_RETRY_INVALID_SLOT, which prevents endless
retries in kvm_tdp_map_page() and tdx_handle_ept_violation() while avoiding
exiting to userspace.

Patch 2 provides a selftest to reproduce the deadlock when patch 1 is
absent and verifies the pre-fault can correctly completes when the faulting
range is in a removing memslot.

Note: The selftest pre_fault_memory_test relies on commit 20a6cff3b283
("KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()") to run
the second ioctl KVM_PRE_FAULT_MEMORY successfully after a memory slot is
removed during the first ioctl KVM_PRE_FAULT_MEMORY.
Base commit:
kvm-x86/next or
kvm/next + commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete
roots in kvm_mmu_reload()") 

Yan Zhao (2):
  KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid
    slot
  KVM: selftests: Test prefault memory with concurrent memslot removal

 arch/x86/kvm/mmu/mmu.c                        |  3 +-
 arch/x86/kvm/mmu/mmu_internal.h               |  3 +
 .../selftests/kvm/pre_fault_memory_test.c     | 82 +++++++++++++++----
 3 files changed, 72 insertions(+), 16 deletions(-)


base-commit: e59ae112a7c2e0f58d9f5905de299fe206f84af0
-- 
2.43.2


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ