lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZvrJvucBw1iIwEG6@google.com>
Date: Mon, 30 Sep 2024 08:54:38 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Isaku Yamahata <isaku.yamahata@...el.com>, kvm@...r.kernel.org, sagis@...gle.com, 
	chao.gao@...el.com, pbonzini@...hat.com, rick.p.edgecombe@...el.com, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: x86/tdp_mmu: Trigger the callback only when an
 interesting change

On Sun, Sep 29, 2024, Yan Zhao wrote:
> On Fri, Sep 27, 2024 at 07:32:10AM -0700, Sean Christopherson wrote:
> > On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > > On Thu, Sep 26, 2024, Yan Zhao wrote:
> > > > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > > > > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > > > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > > > > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > > > > fairly confident that series can land in 6.13 (lots and lots of small patches).
> > > > > 
> > > > > ---
> > > > > Author:     Sean Christopherson <seanjc@...gle.com>
> > > > > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > > > > Commit:     Sean Christopherson <seanjc@...gle.com>
> > > > > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> > > > > 
> > > > >     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> > > > >     
> > > > >     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> > > > >     leaf SPTE (with the same target pfn) and clears the Writable bit or the
> > > > >     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
> > > > >     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> > > > >     if the existing SPTE is writable.
> > > > >     
> > > > >     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> > > > >     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> > > > >     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
> > > > >     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> > > > >     the only SPTE that is dirty at the time of the next relevant clearing of
> > > > >     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> > > > >     because it sees the D=0 SPTE, and thus will complete the clearing of the
> > > > >     dirty logs without performing a TLB flush.
> > > > But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> > > > matter clear_dirty_gfn_range() finds a D bit or not.
> > > 
> > > Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
> > > posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> > > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> > > anything should be backported it's that commit.
> > 
> > Actually, a better idea.  I think it makes sense to fully commit to not flushing
> > when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
> > TLB flush.
> > 
> > E.g. on top of this change in the mega-series is a cleanup to unify the TDP MMU
> > and shadow MMU logic for clearing Writable and Dirty bits, with this comment
> > (which is a massaged version of an existing comment for mmu_spte_update()):
> > 
> > /*
> >  * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
> >  * TLBs must be flushed.  Otherwise write-protecting the gfn may find a read-
> >  * only SPTE, even though the writable SPTE might be cached in a CPU's TLB.
> >  *
> >  * Remote TLBs also need to be flushed if the Dirty bit is cleared, as false
> >  * negatives are not acceptable, e.g. if KVM is using D-bit based PML on VMX.
> >  *
> >  * Don't flush if the Accessed bit is cleared, as access tracking tolerates
> >  * false negatives, and the one path that does care about TLB flushes,
> >  * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track().
> I have a question about why access tracking tolerates false negatives on the
> path kvm_mmu_notifier_clear_flush_young().
> 
> kvm_mmu_notifier_clear_flush_young() invokes kvm_flush_remote_tlbs()
> only when kvm_age_gfn() returns true. But age_gfn_range()/kvm_age_rmap() will
> return false if the old spte is !is_accessed_spte().
> 
> So, if the Access bit is cleared in make_spte(), is a TLB flush required to
> avoid that it's not done in kvm_mmu_notifier_clear_flush_young()?

Because access tracking in general is tolerant of stale results due to lack of
TLB flushes.  E.g. on many architectures, the primary MMU has omitted TLB flushes
for years (10+ years on x86, commit b13b1d2d8692).  The basic argument is that if
there is enough memory pressure to trigger reclaim, then there will be enough TLB
pressure to ensure that omitting the TLB flush doesn't result in a large number
of "bad" reclaims[1].  And conversely, if there isn't much TLB pressure, then the
kernel shouldn't be reclaiming.

For KVM, I want to completely eliminate the TLB flush[2] for all architectures
where it's architecturally legal.  Because for KVM, the flushes are often even
more expensive than they are for the primary MMU, e.g. due to lack of batching,
the cost of VM-Exit => VM-Enter (for architectures without broadcast flushes).

[1] https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@mail.gmail.com
[2] https://lore.kernel.org/all/Zmnbb-Xlyz4VXNHI@google.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ