[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120418230837.GA9842@amt.cnet>
Date: Wed, 18 Apr 2012 20:08:37 -0300
From: Marcelo Tosatti <mtosatti@...hat.com>
To: Xiao Guangrong <xiaoguangrong@...ux.vnet.ibm.com>
Cc: Avi Kivity <avi@...hat.com>, LKML <linux-kernel@...r.kernel.org>,
KVM <kvm@...r.kernel.org>
Subject: Re: [PATCH v2 14/16] KVM: MMU: fast path of handling guest page fault
On Wed, Apr 18, 2012 at 11:53:51AM +0800, Xiao Guangrong wrote:
> On 04/18/2012 09:47 AM, Marcelo Tosatti wrote:
>
> > On Fri, Apr 13, 2012 at 06:16:33PM +0800, Xiao Guangrong wrote:
> >> If the the present bit of page fault error code is set, it indicates
> >> the shadow page is populated on all levels, it means what we do is
> >> only modify the access bit which can be done out of mmu-lock
> >>
> >> Currently, in order to simplify the code, we only fix the page fault
> >> caused by write-protect on the fast path
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@...ux.vnet.ibm.com>
> >> ---
> >> arch/x86/kvm/mmu.c | 205 ++++++++++++++++++++++++++++++++++++++++----
> >> arch/x86/kvm/paging_tmpl.h | 3 +
> >> 2 files changed, 192 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index efa5d59..fc91667 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -446,6 +446,13 @@ static bool __check_direct_spte_mmio_pf(u64 spte)
> >> }
> >> #endif
> >>
> >> +static bool spte_wp_by_dirty_log(u64 spte)
> >> +{
> >> + WARN_ON(is_writable_pte(spte));
> >> +
> >> + return (spte & SPTE_ALLOW_WRITE) && !(spte & SPTE_WRITE_PROTECT);
> >> +}
> >> +
> >> static bool spte_has_volatile_bits(u64 spte)
> >> {
> >> if (!shadow_accessed_mask)
> >> @@ -454,9 +461,18 @@ static bool spte_has_volatile_bits(u64 spte)
> >> if (!is_shadow_present_pte(spte))
> >> return false;
> >>
> >> - if ((spte & shadow_accessed_mask) &&
> >> - (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
> >> - return false;
> >> + if (spte & shadow_accessed_mask) {
> >> + if (is_writable_pte(spte))
> >> + return !(spte & shadow_dirty_mask);
> >> +
> >> + /*
> >> + * If the spte is write-protected by dirty-log, it may
> >> + * be marked writable on fast page fault path, then CPU
> >> + * can modify the Dirty bit.
> >> + */
> >> + if (!spte_wp_by_dirty_log(spte))
> >> + return false;
> >> + }
> >>
> >> return true;
> >> }
> >> @@ -1109,26 +1125,18 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
> >> rmap_remove(kvm, sptep);
> >> }
> >>
> >> -static bool spte_wp_by_dirty_log(u64 spte)
> >> -{
> >> - WARN_ON(is_writable_pte(spte));
> >> -
> >> - return (spte & SPTE_ALLOW_WRITE) && !(spte & SPTE_WRITE_PROTECT);
> >> -}
> >> -
> >> static void spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
> >> bool *flush, bool page_table_protect)
> >> {
> >> u64 spte = *sptep;
> >>
> >> if (is_writable_pte(spte)) {
> >> - *flush |= true;
> >> -
> >> if (large) {
> >> pgprintk("rmap_write_protect(large): spte %p %llx\n",
> >> spte, *spte);
> >> BUG_ON(!is_large_pte(spte));
> >>
> >> + *flush |= true;
> >> drop_spte(kvm, sptep);
> >> --kvm->stat.lpages;
> >> return;
> >> @@ -1137,6 +1145,9 @@ static void spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
> >> goto reset_spte;
> >> }
> >>
> >> + /* We need flush tlbs in this case: the fast page fault path
> >> + * can mark the spte writable after we read the sptep.
> >> + */
> >> if (page_table_protect && spte_wp_by_dirty_log(spte))
> >> goto reset_spte;
> >>
> >> @@ -1144,6 +1155,8 @@ static void spte_write_protect(struct kvm *kvm, u64 *sptep, bool large,
> >>
> >> reset_spte:
> >> rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
> >> +
> >> + *flush |= true;
> >> spte = spte & ~PT_WRITABLE_MASK;
> >> if (page_table_protect)
> >> spte |= SPTE_WRITE_PROTECT;
> >> @@ -2767,18 +2780,172 @@ exit:
> >> return ret;
> >> }
> >>
> >> +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn,
> >> + u32 error_code)
> >> +{
> >> + unsigned long *rmap;
> >> +
> >> + /*
> >> + * #PF can be fast only if the shadow page table is present and it
> >> + * is caused by write-protect, that means we just need change the
> >> + * W bit of the spte which can be done out of mmu-lock.
> >> + */
> >> + if (!(error_code & PFERR_PRESENT_MASK) ||
> >> + !(error_code & PFERR_WRITE_MASK))
> >> + return false;
> >> +
> >> + rmap = gfn_to_rmap(vcpu->kvm, gfn, PT_PAGE_TABLE_LEVEL);
> >> +
> >> + /* Quickly check the page can be writable. */
> >> + if (test_bit(PTE_LIST_WP_BIT, ACCESS_ONCE(rmap)))
> >> + return false;
> >> +
> >> + return true;
> >> +}
> >> +
> >> +static bool
> >> +fast_pf_fix_indirect_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >> + u64 *sptep, u64 spte, gfn_t gfn)
> >> +{
> >> + pfn_t pfn;
> >> + bool ret = false;
> >> +
> >> + /*
> >> + * For the indirect spte, it is hard to get a stable gfn since
> >> + * we just use a cmpxchg to avoid all the races which is not
> >> + * enough to avoid the ABA problem: the host can arbitrarily
> >> + * change spte and the mapping from gfn to pfh.
> >> + *
> >> + * What we do is call gfn_to_pfn_atomic to bind the gfn and the
> >> + * pfn because after the call:
> >> + * - we have held the refcount of pfn that means the pfn can not
> >> + * be freed and be reused for another gfn.
> >> + * - the pfn is writable that means it can not be shared by different
> >> + * gfn.
> >> + */
> >> + pfn = gfn_to_pfn_atomic(vcpu->kvm, gfn);
> >
> > Please document what can happen in parallel whenever you manipulate
> > sptes without mmu_lock held, convincing the reader that this is safe.
> >
>
>
> OK, I will documents it in locking.txt
>
> I am not good at documenting, How about the below description?
>
> What we use to avoid all the race is spte.SPTE_ALLOW_WRITE and spte.SPTE_WRITE_PROTECT:
> SPTE_ALLOW_WRITE means the gfn is writable on both guest and host, and SPTE_ALLOW_WRITE
> means this gfn is not write-protected for shadow page write protection.
>
> On fast page fault path, we will atomically set the spte.W bit if
> spte.SPTE_WRITE_PROTECT = 1 and spte.SPTE_WRITE_PROTECT = 0, this is safe because whenever
> changing these bits can be detected by cmpxchg.
>
> But we need carefully check the mapping between gfn to pfn since we can only ensure the pfn
> is not changed during cmpxchg. This is a ABA problem, for example, below case will happen:
>
> At the beginning:
> gpte = gfn1
> gfn1 is mapped to pfn1 on host
> spte is the shadow page table entry corresponding with gpte and
> spte = pfn1
>
> VCPU 0 VCPU0
> on page fault path:
>
> old_spte = *spte;
> pfn1 is swapped out:
> spte = 0;
> pfn1 is re-alloced for gfn2
> gpte is changed by host, and pointing to gfn2:
> spte = pfn1;
>
> if (cmpxchg(spte, old_spte, old_spte+W)
> mark_page_dirty(vcpu->kvm, gfn1)
> OOPS!!!
> we dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
>
> For direct sp, we can easily avoid it since the spte of direct sp is fixed to gfn.
> For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic to pin gfn to pfn,
> because after gfn_to_pfn_atomic:
> - we have held the refcount of pfn that means the pfn can not
> be freed and be reused for another gfn.
> - the pfn is writable that means it can not be shared by different
> gfn by KSM.
> Then, we can ensure the dirty bitmaps is correctly set for a gfn.
This is one possible scenario, OK. If you can list all possibilities,
better.
We need to check every possible case carefully.
> > Same with current users of walk_shadow_page_lockless_begin.
>
>
> After walk_shadow_page_lockless_begin, it is safe since reader_counter has been
> increased:
>
> static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
> {
> rcu_read_lock();
> atomic_inc(&vcpu->kvm->arch.reader_counter);
>
> /* Increase the counter before walking shadow page table */
> smp_mb__after_atomic_inc();
> }
>
> It is not enough?
It is, i don't know what i was talking about.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists