linux-kernel - Re: [PATCH] mmu notifiers #v5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080203031457.GA16127@sgi.com>
Date:	Sat, 2 Feb 2008 21:14:57 -0600
From:	Jack Steiner <steiner@....com>
To:	Andrea Arcangeli <andrea@...ranet.com>
Cc:	Christoph Lameter <clameter@....com>, Robin Holt <holt@....com>,
	Avi Kivity <avi@...ranet.com>, Izik Eidus <izike@...ranet.com>,
	kvm-devel@...ts.sourceforge.net,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	daniel.blueman@...drics.com
Subject: Re: [PATCH] mmu notifiers #v5

On Sun, Feb 03, 2008 at 03:17:04AM +0100, Andrea Arcangeli wrote:
> On Fri, Feb 01, 2008 at 11:23:57AM -0800, Christoph Lameter wrote:
> > Yes so your invalidate_range is still some sort of dysfunctional 
> > optimization? Gazillions of invalidate_page's will have to be executed 
> > when tearing down large memory areas.
> 
> I don't know if gru can flush the external TLB references with a
> single instruction like the cpu can do by overwriting %cr3. As far as

The mechanism obviously is a little different, but the GRU can flush an
external TLB in a single read/write/flush of a cacheline (should take a few
100 nsec). Typically, an application uses a single GRU. Threaded
applications, however, could use one GRU per thread, so it is possible that
multiple TLBs must be flushed. In some cases, the external flush can be
avoided if the GRU is not currently in use by the thread.

Also, most (but not all) applications that use the GRU do not usually do
anything that requires frequent flushing (fortunately). The GRU is intended
for HPC-like applications. These don't usually do frequent map/unmap
operations or anything else that requires a lot of flushes.

I expect that KVM is a lot different.

I have most of the GRU code working with the latest mmuops patch. I still
have a list of loose ends that I'll get to next week. The most important is
the exact handling of the range invalidates. The code that I currently have
works (mostly) but has a few endcases that will cause problems. Once I
finish, I'll be glad to send you snippets of the code (or all of it) if you
would like to take a look.


> vmx/svm are concerned, you got to do some fixed amount of work on the
> rmap structures and each spte, for each 4k invalidated regardless of
> page/pages/range/ranges. You can only share the cost of the lock and
> of the memslot lookup, so you lookup and take the lock once and you
> drop 512 sptes instead of just 1. Similar to what we do in the main
> linux VM by taking the PT lock for every 512 sptes.
> 
> So you worry about gazillions of invalidate_page without taking into
> account that even if you call invalidate_range_end(0, -1ULL) KVM will
> have to mangle over 4503599627370496 rmap entries anyway. Yes calling
> 1 invalidate_range_end instead of 4503599627370496 invalidate_page,
> will be faster, but not much faster. For KVM it remains an O(N)
> operation, where N is the number of pages. I'm not so sure we need to
> worry about my invalidate_pages being limited to 512 entries.
> 
> Perhaps GRU is different, I'm just describing the KVM situation here.
> 
> As far as KVM is concerned it's more sensible to be able to scale when
> there are 1024 kvm threads on 1024 cpus and each kvm-vcpu is accessing
> a different guest physical page (especially when new chunks of memory
> are allocated for the first time) that won't collide the whole time on
> naturally aligned 2M chunks of virtual addresses.
> 
> > And that would not be enough to hold of new references? With small tweaks 
> > this should work with a common scheme. We could also redefine the role 
> > of _start and _end slightly to just require that the refs are removed when 
> > _end completes. That would allow the KVM page count ref to work as is now 
> > and would avoid the individual invalidate_page() callouts.
> 
> I can already only use _end and ignore _start, only remaining problem
> is this won't be 100% coherent (my patch is 100% coherent thanks to PT
> lock implicitly serializing follow_page/get_user_pages of the KVM/GRU
> secondary MMU faults). My approach give a better peace of mind with
> full scalability and no lock recursion when it's the KVM page fault
> that invokes get_user_pages that invokes the linux page fault routines
> that will try to execute _start taking the lock to block the page
> fault that is already running...
> 
> > > > The GRU has no page table on its own. It populates TLB entries on demand 
> > > > using the linux page table. There is no way it can figure out when to 
> > > > drop page counts again. The invalidate calls are turned directly into tlb 
> > > > flushes.
> > > 
> > > Yes, this is why it can't serialize follow_page with only the PT lock
> > > with your patch. KVM may do it once you add start,end to range_end
> > > only thanks to the additional pin on the page.
> > 
> > Right but that pin requires taking a refcount which we cannot do.
> 
> GRU can use my patch without the pin. XPMEM obviously can't use my
> patch as my invalidate_page[s] are under the PT lock (a feature to fit
> GRU/KVM in the simplest way), this is why an incremental patch adding
> invalidate_range_start/end would be required to support XPMEM too.

--- jack
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/