linux-kernel - [PATCH v2 1/4] KVM: x86/mmu: Use vCPU's APICv status when handling APIC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20211022004927.1448382-2-seanjc@google.com>
Date:   Thu, 21 Oct 2021 17:49:24 -0700
From:   Sean Christopherson <seanjc@...gle.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     Sean Christopherson <seanjc@...gle.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Maxim Levitsky <mlevitsk@...hat.com>
Subject: [PATCH v2 1/4] KVM: x86/mmu: Use vCPU's APICv status when handling
 APIC_ACCESS memslot

Query the vCPU's APICv state, not the overall VM's state, when handling
a page fault that hit the APIC Access Page memslot.  While the vCPU
state can be stale with respect to the overall VM APICv state, it's at
least coherent with respect to the page fault being handled, i.e. is the
value of the vCPU's APICv enabling at the time of the fault.

Using the VM's state is not functionally broken, but neither does it
provide any meaningful advantages, and arguably retrying the faulting
instruction, as will happen if the vCPU state is stale (see below), is
semantically the desired behavior as it aligns with existing KVM MMU
behavior where a change in VM state invalidates the context for handling
a page fault.

The vCPU state can be stale if a different vCPU (or other actor) toggles
APICv state after the page fault exit from the guest, in which case KVM
will attempt to handle the page fault prior to servicing the pending
KVM_REQ_APICV_UPDATE.  However, by design such a race is benign in all
scenarios.

If APICv is globally enabled but locally disabled, the page fault handler
will go straight to emulation (emulation of the local APIC is ok even if
APIC virtualization is enabled while the guest is running).

If APICv is globally disabled but locally enabled, there are two possible
scenarios, and in each scenario the final outcome is correct: KVM blocks
all vCPUs until SPTEs for the APIC Access Page have been zapped.

  (1) the page fault acquires mmu_lock before the APICv update calls
      kvm_zap_gfn_range().  The page fault will install a "bad" SPTE, but
      that SPTE will never be consumed as (a) no vCPUs can be running in
      the guest due to the kick from KVM_REQ_APICV_UPDATE, and (b) the
      "bad" SPTE is guaranteed to be zapped by kvm_zap_gfn_range() before
      any vCPU re-enters the guest.  Because KVM_REQ_APICV_UPDATE is
      raised before the VM state is update, all vCPUs will block on
      apicv_update_lock when attempting to service the request until the
      lock is dropped by the update flow, _after_ the SPTE is zapped.

  (2) the APICv update kvm_zap_gfn_range() acquires mmu_lock before the
      page fault.  The page fault handler will bail due to the change in
      mmu_notifier_seq made by kvm_zap_gfn_range().

Arguably, KVM should not (attempt to) install a SPTE based on state that
is knowingly outdated, but using the per-VM state suffers a similar flaw
as not checking for a pending KVM_REQ_APICV_UPDATE means that KVM will
either knowingly install a SPTE that will get zapped (page fault wins the
race) or will get rejected (kvm_zap_gfn_range() wins the race).  However,
adding a check for KVM_REQ_APICV_UPDATE is extremely undesirable as it
implies the check is meaningful/sufficient, which is not the case since
the page fault handler doesn't hold mmu_lock when it checks APICv status,
i.e. the mmu_notifier_seq mechanism is required for correctness.

Cc: Maxim Levitsky <mlevitsk@...hat.com>
Signed-off-by: Sean Christopherson <seanjc@...gle.com>
---
 arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f9f228963088..6530d874f5b7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3911,10 +3911,35 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		 * If the APIC access page exists but is disabled, go directly
 		 * to emulation without caching the MMIO access or creating a
 		 * MMIO SPTE.  That way the cache doesn't need to be purged
-		 * when the AVIC is re-enabled.
+		 * when the AVIC is re-enabled.  Note, the vCPU's APICv state
+		 * may be stale if an APICv update is in-progress, but KVM is
+		 * guaranteed to operate correctly in all scenarios:
+		 *
+		 * 1.  APICv is globally enabled but locally disabled.  This
+		 *     vCPU will go straight to emulation without touching the
+		 *     MMIO cache or installing a SPTE.
+		 *
+		 * 2a. APICv is globally disabled but locally enabled, and this
+		 *     vCPU acquires mmu_lock before the APICv update calls
+		 *     kvm_zap_gfn_range().  This vCPU will install a bad SPTE,
+		 *     but the SPTE will never be consumed as (a) no vCPUs can
+		 *     be running due to the kick from KVM_REQ_APICV_UPDATE,
+		 *     and (b) the "bad" SPTE is guaranteed to be zapped by
+		 *     kvm_zap_gfn_range() before any vCPU re-enters the guest.
+		 *     Because KVM_REQ_APICV_UPDATE is raised before the VM
+		 *     state is update, vCPUs will block on apicv_update_lock
+		 *     when attempting to service the request until the lock is
+		 *     dropped by the update flow, _after_ the SPTE is zapped.
+		 *
+		 * 2b. APICv is globally disabled but locally enabled, and the
+		 *     APICv update kvm_zap_gfn_range() acquires mmu_lock before
+		 *     this vCPU.  This vCPU will bail from the page fault
+		 *     handler due kvm_zap_gfn_range() bumping mmu_notifier_seq,
+		 *     and will retry the faulting instruction after servicing
+		 *     the KVM_REQ_APICV_UPDATE request.
 		 */
 		if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
-		    !kvm_apicv_activated(vcpu->kvm)) {
+		    !kvm_vcpu_apicv_active(vcpu)) {
 			*r = RET_PF_EMULATE;
 			return true;
 		}
-- 
2.33.0.1079.g6e70778dc9-goog