linux-kernel - Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YaV02MdGYlfGs35T@google.com>
Date:   Tue, 30 Nov 2021 00:48:24 +0000
From:   David Matlack <dmatlack@...gle.com>
To:     Mingwei Zhang <mizhang@...gle.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Ben Gardon <bgardon@...gle.com>
Subject: Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf
 SPTEs and avoid rcu stall

On Wed, Nov 24, 2021 at 09:44:21PM +0000, Mingwei Zhang wrote:
> TDP MMU SPTE zapping process currently uses two levels of iterations. The
> first level iteration happens at the for loop within the zap_gfn_range()
> with the purpose of calibrating the accurate range for zapping. The second
> level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the

iteration

> whole paging structures (leaf and non-leaf SPTEs) within the range. The
> former iteration is yield safe, while the second one is not.

I know what you mean but I'd suggest being more specific than "yield
safe". For example:

  Unlike the outer loop, the recursive zapping done under
  tdp_mmu_set_spte{,_atomic} does not yield. Since zapping is done with
  a pre-order traversal, zapping sufficiently large ranges can lead to
  RCU stall warnings.

I'd also clarify here that the TDP MMU iterator uses a pre-order
traversal which causes us KVM to end up doing the maximum amount of
zapping under tdp_mmu_set_spte{,_atomic} and not the outer for loop.

> 
> In many cases, zapping SPTE process could be optimized since the non-leaf
> SPTEs could most likely be retained for the next allocation. On the other
> hand, for large scale SPTE zapping scenarios, we may end up zapping too
> many SPTEs and use excessive CPU time that causes the RCU stall warning.
> 
> The follow selftest reproduces the warning:
> 
> 	(env: kvm.tdp_mmu=Y)
> 	./dirty_log_perf_test -v 64 -b 8G
> 
> Optimize the zapping process by skipping all SPTEs above a certain level in
> the first iteration. This allows us to control the granularity of the
> actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition,
> we would retain some of the non-leaf SPTEs to accelerate next allocation.
> 
> For the selection of the `certain level`, we choose the PG_LEVEL_1G because
> it is currently the largest page size supported and it natually fits the
> scenario of splitting large pages.
> 
> For `zap_all` case (usually) at VM teardown time, we use a two-phase
> mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase
> zaps everything else. This is achieved by the helper function
> __zap_gfn_range().
> 
> Cc: Sean Christopherson <seanjc@...gle.com>
> Cc: Ben Gardon <bgardon@...gle.com>
> Cc: David Matlack <dmatlack@...gle.com>
> 
> Signed-off-by: Mingwei Zhang <mizhang@...gle.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------
>  1 file changed, 40 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 89d16bb104de..3fadc51c004a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>   * account for the possibility that other threads are modifying the paging
>   * structures concurrently. If shared is false, this thread should hold the
>   * MMU lock in write mode.
> + *
> + * If zap_all is true, eliminate all the paging structures that contains the
> + * SPTEs.
>   */
> -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> -			  bool shared)
> +static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			    gfn_t start, gfn_t end, bool can_yield, bool flush,
> +			    bool shared, bool zap_all)
>  {
> -	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> -	bool zap_all = (start == 0 && end >= max_gfn_host);
>  	struct tdp_iter iter;
>  
> -	/*
> -	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> -	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> -	 * and so KVM will never install a SPTE for such addresses.
> -	 */
> -	end = min(end, max_gfn_host);
> -
> -	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> -
>  	rcu_read_lock();
>  
>  	tdp_root_for_each_pte(iter, root, start, end) {
> @@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  			continue;
>  		}
>  
> -		if (!is_shadow_present_pte(iter.old_spte))
> +		/*
> +		 * In zap_all case, ignore the checking of present since we have
> +		 * to zap everything.
> +		 */
> +		if (!zap_all && !is_shadow_present_pte(iter.old_spte))
>  			continue;

I don't believe there's any reason to attempt to zap a non-present spte,
even in the zap_all case. In any case, this change deserves its own
patch and a commit message that describes why the old logic is incorrect
and how this fixes it.

>  
>  		/*
>  		 * If this is a non-last-level SPTE that covers a larger range
>  		 * than should be zapped, continue, and zap the mappings at a
> -		 * lower level, except when zapping all SPTEs.
> +		 * lower level. Actual zapping started at proper granularity
> +		 * that is not so large as to cause a soft lockup when handling
> +		 * the changed pte (which does not yield).
>  		 */
>  		if (!zap_all &&
>  		    (iter.gfn < start ||
> -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> +		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> +		     iter.level > PG_LEVEL_1G) &&
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;

This if statement is getting a bit long. I'd suggest breaking out the
level check and also using KVM_MAX_HUGEPAGE_LEVEL.

e.g.

        /*
         * If not doing zap_all, only zap up to the huge page level to
         * avoid doing too much work in the recursive tdp_mmu_set_spte*
         * call below, since it does not yield.
         *
         * This will potentially leave behind some childless page tables
         * but that's ok because ...
         */
         if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
                continue;

And on that note, what is the reasoning for why it's ok to leave behind
childless page tables? I assume it's because most of the time we'll use
that page table again in the future, and at worst we leave the page
table allocated until the VM is cleaned up?

>  
> @@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  	return flush;
>  }
>  
> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> +			  bool shared)
> +{
> +	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> +	bool zap_all = (start == 0 && end >= max_gfn_host);
> +
> +	/*
> +	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> +	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> +	 * and so KVM will never install a SPTE for such addresses.
> +	 */
> +	end = min(end, max_gfn_host);
> +
> +	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +
> +	flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared,
> +				false);
> +	if (zap_all)
> +		flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush,
> +					shared, true);
> +	return flush;
> +}
> +
>  /*
>   * Tears down the mappings for the range of gfns, [start, end), and frees the
>   * non-root pages mapping GFNs strictly within that range. Returns true if
> -- 
> 2.34.0.rc2.393.gf8c9666880-goog
>