[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zva4aORxE9ljlMNe@google.com>
Date: Fri, 27 Sep 2024 06:51:58 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Isaku Yamahata <isaku.yamahata@...el.com>, kvm@...r.kernel.org, sagis@...gle.com,
chao.gao@...el.com, pbonzini@...hat.com, rick.p.edgecombe@...el.com,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: x86/tdp_mmu: Trigger the callback only when an
interesting change
On Thu, Sep 26, 2024, Yan Zhao wrote:
> On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > KVM MMU behavior
> > > ================
> > > The leaf SPTE state machine is coded in make_spte(). Consider AD bits and
> > > the present bits for simplicity. The two functionalities and AD bits
> > > support are related in this context. unsync (manipulate D bit and W bit,
> > > and handle write protect fault) and access tracking (manipulate A bit and
> > > present bit, and hand handle page fault). (We don't consider dirty page
> > > tracking for now as it's future work of TDX KVM.)
> > >
> > > * If AD bit is enabled:
> > > D bit state change for dirty page tracking
> > > On the first EPT violation without prefetch,
> > > - D bits are set.
> > > - Make SPTE writable as TDX supports only RXW (or if write fault).
> > > (TDX KVM doesn't support write protection at this state. It's future work.)
> > >
> > > On the second EPT violation.
> > > - clear D bits if write fault.
> >
> > Heh, I was literally just writing changelogs for patches to address this (I told
> > Sagi I would do it "this week"; that was four weeks ago).
> >
> > This is a bug in make_spte(). Replacing a W=1,D=1 SPTE with a W=1,D=0 SPTE is
> > nonsensical. And I'm pretty sure it's an outright but for the TDP MMU (see below).
> >
> > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > fairly confident that series can land in 6.13 (lots and lots of small patches).
> >
> > ---
> > Author: Sean Christopherson <seanjc@...gle.com>
> > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > Commit: Sean Christopherson <seanjc@...gle.com>
> > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> >
> > KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> >
> > Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> > leaf SPTE (with the same target pfn) and clears the Writable bit or the
> > Dirty bit. KVM isn't _supposed_ to clear Writable or Dirty bits in such
> > a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> > if the existing SPTE is writable.
> >
> > E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> > SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> > second vCPU. If the first vCPU (or another vCPU) accesses memory using
> > the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> > the only SPTE that is dirty at the time of the next relevant clearing of
> > the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> > because it sees the D=0 SPTE, and thus will complete the clearing of the
> > dirty logs without performing a TLB flush.
> But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> matter clear_dirty_gfn_range() finds a D bit or not.
Oh, right, I forgot about that. I'll tweak the changelog to call that out before
posting. Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
anything should be backported it's that commit.
> kvm_mmu_slot_apply_flags
> |kvm_mmu_slot_leaf_clear_dirty
> | |kvm_tdp_mmu_clear_dirty_slot
> | |clear_dirty_gfn_range
> |kvm_flush_remote_tlbs_memslot
>
> > Opportunistically harden the TDP MMU against clearing the Writable bit as
> > well, both to prevent similar bugs for write-protection, but also so that
> > the logic can eventually be deduplicated into spte.c (mmu_spte_update() in
> > the shadow MMU has similar logic).
> >
> > Fix the bug in the TDP MMU's page fault handler even though make_spte() is
> > clearly doing odd things, e.g. it marks the page dirty in its slot but
> > doesn't set the Dirty bit. Precisely because replacing a leaf SPTE with
> > another leaf SPTE is so rare, the cost of hardening KVM against such bugs
> > is negligible. The make_spte() will be addressed in a future commit.
> >
> > Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler")
> > Cc: David Matlack <dmatlack@...gle.com>
> > Cc: stable@...r.kernel.org
> > Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 3b996c1fdaab..7c6d1c610f0e 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1038,7 +1038,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> > else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
> > return RET_PF_RETRY;
> > else if (is_shadow_present_pte(iter->old_spte) &&
> > - !is_last_spte(iter->old_spte, iter->level))
> > + (!is_last_spte(iter->old_spte, iter->level) ||
> > + (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
> > + (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))))
> Should we also check is_accessed_spte() as what's done in mmu_spte_update()?
Heh, it's impossible to see in this standalone "posting", but earlier in the
massive series is a patch that removes that code. Aptly titled "KVM: x86/mmu:
Don't force flush if SPTE update clears Accessed bit".
> It is possible for make_spte() to make the second spte !is_dirty_spte(), e.g.
> the second one is caused by a KVM_PRE_FAULT_MEMORY ioctl.
Ya, the mega-series also has a fix for that: "KVM: x86/mmu: Always set SPTE's
dirty bit if it's created as writable".
> > kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
> >
> > /*
> > ---
> >
> > > arch/x86/kvm/mmu/spte.h | 6 ++++++
> > > arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++---
> > > 2 files changed, 31 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > > index a72f0e3bde17..1726f8ec5a50 100644
> > > --- a/arch/x86/kvm/mmu/spte.h
> > > +++ b/arch/x86/kvm/mmu/spte.h
> > > @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> > > */
> > > #define FROZEN_SPTE (SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> > >
> > > +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK \
> > > + (shadow_acc_track_mask | \
> > > + (SHADOW_ACC_TRACK_SAVED_BITS_MASK << \
> > > + SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) | \
> >
> > Just make TDX require A/D bits, there's no reason to care about access tracking.
> If KVM_PRE_FAULT_MEMORY is allowed for TDX and if
> cpu_has_vmx_ept_ad_bits() is false in TDX's hardware (not sure if it's possible),
Make it a requirement in KVM that TDX hardware supports A/D bits and that KVM's
module param is enabled. EPT A/D bits have been supported in all CPUs since
Haswell, I don't expect them to ever go away.
> access tracking bit is possbile to be changed, as in below scenario:
>
> 1. KVM_PRE_FAULT_MEMORY ioctl calls kvm_arch_vcpu_pre_fault_memory() to map
> a GFN, and make_spte() will call mark_spte_for_access_track() to
> remove shadow_acc_track_mask (i.e. RWX bits) and move R+X left to
> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT.
> 2. If a concurrent page fault occurs on the same GFN on another vCPU, then
> make_spte() in that vCPU will not see prefetch and the new_spte is
> with RWX bits and with no bits set in SHADOW_ACC_TRACK_SAVED_MASK.
This should be fixed by the mega-series. I'll make sure to Cc you on that series.
Thanks much for the input and feedback!
Powered by blists - more mailing lists