linux-kernel - Re: [PATCH v4 12/12] KVM: x86/mmu: convert kvm_zap_gfn_range() to use shared mmu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZPWHtUh9SZDc4EoN@yzhao56-desk.sh.intel.com>
Date:   Mon, 4 Sep 2023 15:31:01 +0800
From:   Yan Zhao <yan.y.zhao@...el.com>
To:     Sean Christopherson <seanjc@...gle.com>
CC:     <kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <pbonzini@...hat.com>, <chao.gao@...el.com>, <kai.huang@...el.com>,
        <robert.hoo.linux@...il.com>, <yuan.yao@...ux.intel.com>
Subject: Re: [PATCH v4 12/12] KVM: x86/mmu: convert kvm_zap_gfn_range() to
 use shared mmu_lock in TDP MMU

On Fri, Aug 25, 2023 at 02:34:30PM -0700, Sean Christopherson wrote:
> On Fri, Jul 14, 2023, Yan Zhao wrote:
> > Convert kvm_zap_gfn_range() from holding mmu_lock for write to holding for
> > read in TDP MMU and allow zapping of non-leaf SPTEs of level <= 1G.
> > TLB flushes are executed/requested within tdp_mmu_zap_spte_atomic() guarded
> > by RCU lock.
> > 
> > GFN zap can be super slow if mmu_lock is held for write when there are
> > contentions. In worst cases, huge cpu cycles are spent on yielding GFN by
> > GFN, i.e. the loop of "check and flush tlb -> drop rcu lock ->
> > drop mmu_lock -> cpu_relax() -> take mmu_lock -> take rcu lock" are entered
> > for every GFN.
> > Contentions can either from concurrent zaps holding mmu_lock for write or
> > from tdp_mmu_map() holding mmu_lock for read.
> 
> The lock contention should go away with a pre-check[*], correct?  That's a more
Yes, I think so, though I don't have time to verify it yet.

> complete solution too, in that it also avoids lock contention for the shadow MMU,
> which presumably suffers the same problem (I don't see anything that would prevent
> it from yielding).
> 
> If we do want to zap with mmu_lock held for read, I think we should convert
> kvm_tdp_mmu_zap_leafs() and all its callers to run under read, because unless I'm
> missing something, the rules are the same regardless of _why_ KVM is zapping, e.g.
> the zap needs to be protected by mmu_invalidate_in_progress, which ensures no other
> tasks will race to install SPTEs that are supposed to be zapped.
Yes. I did't do that to the unmap path was only because I don't want to make a
big code change.
The write lock in kvm_unmap_gfn_range() path is taken in arch-agnostic code,
which is not easy to change, right?

> 
> If you post a version of this patch that converts kvm_tdp_mmu_zap_leafs(), please
> post it as a standalone patch.  At a glance it doesn't have any dependencies on the
> MTRR changes, and I don't want this type of changed buried at the end of a series
> that is for a fairly niche setup.  This needs a lot of scrutiny to make sure zapping
> under read really is safe
Given the pre-check patch should work, do you think it's still worthwhile to do
this convertion?

> 
> [*] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com
> 
> > After converting to hold mmu_lock for read, there will be less contentions
> > detected and retaking mmu_lock for read is also faster. There's no need to
> > flush TLB before dropping mmu_lock when there're contentions as SPTEs have
> > been zapped atomically and TLBs are flushed/flush requested immediately
> > within RCU lock.
> > In order to reduce TLB flush count, non-leaf SPTEs not greater than 1G
> > level are allowed to be zapped if their ranges are fully covered in the
> > gfn zap range.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> > ---