linux-kernel - Re: [PATCH 00/13] KVM: MMU: fast page fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120409193144.GA23053@amt.cnet>
Date:	Mon, 9 Apr 2012 16:31:44 -0300
From:	Marcelo Tosatti <mtosatti@...hat.com>
To:	Xiao Guangrong <xiaoguangrong.eric@...il.com>
Cc:	Avi Kivity <avi@...hat.com>,
	Xiao Guangrong <xiaoguangrong@...ux.vnet.ibm.com>,
	LKML <linux-kernel@...r.kernel.org>, KVM <kvm@...r.kernel.org>
Subject: Re: [PATCH 00/13] KVM: MMU: fast page fault

On Tue, Apr 10, 2012 at 02:13:41AM +0800, Xiao Guangrong wrote:
> On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:
> 
> > On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
> >> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> >>> * Idea
> >>> The present bit of page fault error code (EFEC.P) indicates whether the
> >>> page table is populated on all levels, if this bit is set, we can know
> >>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> >>> the reserved bits.
> >>>
> >>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> >>> simply fixed: the page fault caused by reserved bit
> >>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> >>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> >>> is just increasing the corresponding access on the spte.
> >>>
> >>> This pachset introduces a fast path to fix this kind of page fault: it
> >>> is out of mmu-lock and need not walk host page table to get the mapping
> >>> from gfn to pfn.
> >>>
> >>>
> >>
> >> This patchset is really worrying to me.
> >>
> >> It introduces a lot of concurrency into data structures that were not
> >> designed for it.  Even if it is correct, it will be very hard to
> >> convince ourselves that it is correct, and if it isn't, to debug those
> >> subtle bugs.  It will also be much harder to maintain the mmu code than
> >> it is now.
> >>
> >> There are a lot of things to check.  Just as an example, we need to be
> >> sure that if we use rcu_dereference() twice in the same code path, that
> >> any inconsistencies due to a write in between are benign.  Doing that is
> >> a huge task.
> >>
> >> But I appreciate the performance improvement and would like to see a
> >> simpler version make it in.  This needs to reduce the amount of data
> >> touched in the fast path so it is easier to validate, and perhaps reduce
> >> the number of cases that the fast path works on.
> >>
> >> I would like to see the fast path as simple as
> >>
> >>   rcu_read_lock();
> >>
> >>   (lockless shadow walk)
> >>   spte = ACCESS_ONCE(*sptep);
> >>
> >>   if (!(spte & PT_MAY_ALLOW_WRITES))
> >>         goto slow;
> >>
> >>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
> >>   mark_page_dirty(kvm, gfn);
> >>
> >>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
> >>   if (cmpxchg(sptep, spte, new_spte) != spte)
> >>        goto slow;
> >>
> >>   rcu_read_unlock();
> >>   return;
> >>
> >> slow:
> >>   rcu_read_unlock();
> >>   slow_path();
> >>
> >> It now becomes the responsibility of the slow path to maintain *sptep &
> >> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
> >> can be as simple as a clear_bit() before we update sp->gfns[] or if we
> >> add host write protection.
> >>
> >> Sorry, it's too complicated for me.  Marcelo, what's your take?
> > 
> > The improvement is small and limited to special cases (migration should
> > be rare and framebuffer memory accounts for a small percentage of total
> > memory).
> > 
> > For one, how can this be safe against mmu notifier methods?
> > 
> > KSM			      |VCPU0		| VCPU1
> > 		 	      | fault		| fault
> > 			      | cow-page	|
> > 			      |	set spte RW	|
> > 			      |			|
> > write protect host pte	      |			|
> > grab mmu_lock		      |			|
> > remove writeable bit in spte  |			|
> > increase mmu_notifier_seq     |			|  spte = read-only spte
> > release mmu_lock	      |			|  cmpxchg succeeds, RO->RW!
> > 
> > MMU notifiers rely on the fault path sequence being
> > 
> > read host pte
> > read mmu_notifier_seq
> > spin_lock(mmu_lock)
> > if (mmu_notifier_seq changed)
> > 	goodbye, host pte value is stale
> > spin_unlock(mmu_lock)
> > 
> > By the example above, you cannot rely on the spte value alone,
> > mmu_notifier_seq must be taken into account.
> 
> 
> No.
> 
> When KSM change the host page to read-only, the HOST_WRITABLE bit
> of spte should be removed, that means, the spte should be changed
> that can be watched by cmpxchg.
> 
> Note: we mark spte to be writeable only if spte.HOST_WRITABLE is
> set.

Right. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/