lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220423034752.1161007-1-seanjc@google.com>
Date:   Sat, 23 Apr 2022 03:47:40 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     Sean Christopherson <seanjc@...gle.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Ben Gardon <bgardon@...gle.com>,
        David Matlack <dmatlack@...gle.com>,
        Venkatesh Srinivas <venkateshs@...gle.com>,
        Chao Peng <chao.p.peng@...ux.intel.com>
Subject: [PATCH 00/12] KVM: x86/mmu: Bug fixes and cleanups

This all started from an offhand comment wondering if we might be dropping
A/D bits.  Turns out, yes, we are dropping A/D bits (and W bits) in the
TDP MMU.

Attempting to prove that bug exposed two other notable goofs, the first of
which has been lurking for a decade, give or take:

  Patch 1 - KVM treats _all_ MMU-writable SPTEs as volatile, even though
            KVM never clears WRITABLE outside of MMU lock.  As a result,
            the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
            update writable SPTEs.

  Patch 4 - KVM sends literally all EPT violations down the fast page
            fault handler.  The bug was introduced when KVM started
            honoring EPTP12's A/D enabling, circa 2016.

Neither fix seems to have an easily-measurable affect on performance;
page faults are so slow that wasting even a few hundred cycles is dwarfed
by the base cost.

The rest of the series is cleanups, including a v2 of my RET_PF_CONTINUE,
which would conflict in mildly annoying ways.

The first two "DO NOT MERGE" patches are effectively a PoC showing that
sending indirect MMUs down the fast page fault handler can work, it's
just ineffective.  I went down this rabbit hole after I saw a massive
performance drop in the legacy MMU from patch 3 and thought it might be due
to detecting spurious faults out of mmu_lock.  The TDP MMU didn't exhibit
the same problem, so it just _had_ to be write-contention on mmu_lock.
Right!?!?

Wrong. Turns out that spurious faults are a non-issue, but inverting the
"ad_disabled" flag in the shadow page role and preventing reuse?  Yep,
that's a wee bit of an issue.  I included the patches here to publicly
document that this has been tried, it works, it's just likely not a win.

The "DO NOT MERGE" selftest is my attempt to prove that the TDP MMU can
drop W/A/D bits.  It does work in the sense that it can trigger dropped
bits, but it's fairly worthless as a selftest because it requires an
explicit WARN in KVM :-/  Posted mostly as a reminder to myself that
the primary MMU write-protects on writeback (this isn't the first time
I've gone spelunking in that code...).

E.g. it can trigger the WARN on WRITABLE and DIRTY bits this patch on
top of the series:

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index f0af385c56e0..4f1ec56460ad 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -32,6 +32,8 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
                                         u64 new_spte, int level)
 {
+       u64 current_spte;
+
        /*
         * Atomically write the SPTE if it is a shadow-present, leaf SPTE with
         * volatile bits, i.e. has bits that can be set outside of mmu_lock.
@@ -46,7 +48,13 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
         */
        if (is_shadow_present_pte(old_spte) && is_last_spte(old_spte, level) &&
            spte_has_volatile_bits(old_spte))
-               return kvm_tdp_mmu_write_spte_atomic(sptep, new_spte);
+               ;
+
+       current_spte = kvm_tdp_mmu_read_spte(sptep);
+       WARN_ONCE((old_spte ^ current_spte) & current_spte &
+                 (PT_WRITABLE_MASK | PT_ACCESSED_MASK | PT_DIRTY_MASK),
+                 "KVM: Dropped SPTE W/A/D bits: 0x%llx\n",
+                 (old_spte ^ current_spte) & current_spte);
 
        __kvm_tdp_mmu_write_spte(sptep, new_spte);
        return old_spte;

  ------------[ cut here ]------------
  KVM: Dropped SPTE W/A/D bits: 0x2
  WARNING: CPU: 5 PID: 874 at arch/x86/kvm/mmu/tdp_iter.h:56 __tdp_mmu_set_spte+0x18b/0x1a0 [kvm]
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 5 PID: 874 Comm: volatile_spte_t Not tainted 5.18.0-rc1+ #830
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:__tdp_mmu_set_spte+0x18b/0x1a0 [kvm]
  Call Trace:
   <TASK>
   tdp_mmu_zap_leafs+0xbc/0x180 [kvm]
   kvm_tdp_mmu_zap_leafs+0x74/0x90 [kvm]
   kvm_unmap_gfn_range+0xe8/0x100 [kvm]
   kvm_mmu_notifier_invalidate_range_start+0x11c/0x2c0 [kvm]
   __mmu_notifier_invalidate_range_start+0x7e/0x190
   change_protection+0x657/0xb30
   mprotect_fixup+0x1a1/0x310
   do_mprotect_pkey+0x1f9/0x3a0
   __x64_sys_mprotect+0x1b/0x20
   do_syscall_64+0x31/0x50
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  ---[ end trace 0000000000000000 ]---


Sean Christopherson (12):
  KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo
    A/D)
  KVM: x86/mmu: Move shadow-present check out of
    spte_has_volatile_bits()
  KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile
    bits
  KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
  KVM: x86/mmu: Drop exec/NX check from "page fault can be fast"
  KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"
  KVM: x86/mmu: Make all page fault handlers internal to the MMU
  KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults
  KVM: x86/mmu: Expand and clean up page fault stats
  DO NOT MERGE: KVM: x86/mmu: Always send !PRESENT faults down the fast
    path
  DO NOT MERGE: KVM: x86/mmu: Use fast_page_fault() to detect spurious
    shadow MMU faults
  DO NOT MERGE: KVM: selftests: Attempt to detect lost dirty bits

 arch/x86/include/asm/kvm_host.h               |   5 +
 arch/x86/kvm/mmu.h                            |  87 --------
 arch/x86/kvm/mmu/mmu.c                        | 211 ++++++++++--------
 arch/x86/kvm/mmu/mmu_internal.h               | 123 +++++++++-
 arch/x86/kvm/mmu/mmutrace.h                   |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h                |  15 +-
 arch/x86/kvm/mmu/spte.c                       |  28 +++
 arch/x86/kvm/mmu/spte.h                       |  15 +-
 arch/x86/kvm/mmu/tdp_iter.h                   |  34 ++-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  92 +++++---
 arch/x86/kvm/x86.c                            |  24 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   4 +
 .../selftests/kvm/volatile_spte_test.c        | 208 +++++++++++++++++
 14 files changed, 600 insertions(+), 248 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/volatile_spte_test.c


base-commit: 2a39d8b39bffdaf1a4223d0d22f07baee154c8f3
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ