[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YAnzB3Uwn3AVTXGN@google.com>
Date: Thu, 21 Jan 2021 13:32:55 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Ben Gardon <bgardon@...gle.com>
Cc: linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
Paolo Bonzini <pbonzini@...hat.com>,
Peter Xu <peterx@...hat.com>, Peter Shier <pshier@...gle.com>,
Peter Feiner <pfeiner@...gle.com>,
Junaid Shahid <junaids@...gle.com>,
Jim Mattson <jmattson@...gle.com>,
Yulei Zhang <yulei.kernel@...il.com>,
Wanpeng Li <kernellwp@...il.com>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Xiao Guangrong <xiaoguangrong.eric@...il.com>
Subject: Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
On Thu, Jan 21, 2021, Sean Christopherson wrote:
> On Tue, Jan 12, 2021, Ben Gardon wrote:
> > +static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > + bool atomic)
> > +{
> > + if (atomic)
>
> Summarizing an off-list discussion with Ben:
>
> This path isn't reachable in this series, which means all the RCU stuff is more
> or less untestable. Only the page fault path modifies the MMU while hold a read
> lock, and it can't zap non-leaf shadow pages (only zaps large SPTEs and installs
> new SPs).
Aha! I was wrong. This will be hit when KVM zaps a 4k SPTE and installs a
large SPTE overtop a SP, e.g. if the host migrates a page for compaction and
creates a new THP.
tdp_mmu_map_handle_target_level()
tdp_mmu_set_spte_atomic()
handle_changed_spte()
__handle_changed_spte()
handle_disconnected_tdp_mmu_page()
tdp_mmu_unlink_page()
> The intent is to convert other zap-happy paths to a read lock, notably
> kvm_mmu_zap_collapsible_sptes() and kvm_recover_nx_lpages(). Ben will include
> patches to convert at least one of those in the next version of this series so
> that there is justification and coverage for the RCU-deferred freeing.
Somewhat offtopic, zap_collapsible_spte_range() looks wrong. It zaps non-leaf
SPs, and has several comments that make it quite clear that that's its intent,
but the logic is messed up. For non-leaf SPs, PFN points at the next table, not
the final PFN that is mapped into the guest. That absolutely should never be a
reserved PFN, and whether or not its a huge page is irrelevant. My analysis is
more or less confirmed by looking at Ben's internal code, which explicitly does
the exact opposite in that it explicitly zaps leaf SPTEs.
tdp_root_for_each_pte(iter, root, start, end) {
/* Ensure forward progress has been made before yielding. */
if (iter.goal_gfn != last_goal_gfn &&
tdp_mmu_iter_flush_cond_resched(kvm, &iter)) {
last_goal_gfn = iter.goal_gfn;
spte_set = false;
/*
* Yielding caused the paging structure walk to be
* reset so skip to the next iteration to continue the
* walk from the root.
*/
continue;
}
if (!is_shadow_present_pte(iter.old_spte) ||
is_last_spte(iter.old_spte, iter.level)) <--- inverted?
continue;
pfn = spte_to_pfn(iter.old_spte); <-- this would be the page table?
if (kvm_is_reserved_pfn(pfn) ||
!PageTransCompoundMap(pfn_to_page(pfn)))
continue;
tdp_mmu_set_spte(kvm, &iter, 0);
spte_set = true;
}
Coming back to this series, I wonder if the RCU approach is truly necessary to
get the desired scalability. If both zap_collapsible_sptes() and NX huge page
recovery zap _only_ leaf SPTEs, then the only path that can actually unlink a
shadow page while holding the lock for read is the page fault path that installs
a huge page over an existing shadow page.
Assuming the above analysis is correct, I think it's worth exploring alternatives
to using RCU to defer freeing the SP memory, e.g. promoting to a write lock in
the specific case of overwriting a SP (though that may not exist for rwlocks),
or maybe something entirely different?
I actually do like deferred free concept, but I find it difficult to reason
about exactly what protections are provided by RCU, and what even _needs_ to be
protected. Maybe we just need to add some __rcu annotations?
Powered by blists - more mailing lists