linux-kernel - Re: [PATCH] mmu notifiers #v8

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20080303191540.GB11156@sgi.com>
Date:	Mon, 3 Mar 2008 13:15:40 -0600
From:	Jack Steiner <steiner@....com>
To:	Nick Piggin <npiggin@...e.de>
Cc:	Andrea Arcangeli <andrea@...ranet.com>, akpm@...ux-foundation.org,
	Robin Holt <holt@....com>, Avi Kivity <avi@...ranet.com>,
	Izik Eidus <izike@...ranet.com>,
	kvm-devel@...ts.sourceforge.net,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	general@...ts.openfabrics.org,
	Steve Wise <swise@...ngridcomputing.com>,
	Roland Dreier <rdreier@...co.com>,
	Kanoj Sarcar <kanojsarcar@...oo.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	daniel.blueman@...drics.com, Christoph Lameter <clameter@....com>
Subject: Re: [PATCH] mmu notifiers #v8

On Mon, Mar 03, 2008 at 07:45:17PM +0100, Nick Piggin wrote:
> On Mon, Mar 03, 2008 at 12:06:05PM -0600, Jack Steiner wrote:
> > On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote:
> > > > Maintaining a long-term reference on a page is a problem. The GRU does not
> > > > currently maintain tables to track the pages for which dropins have been done.
> > > > 
> > > > The GRU has a large internal TLB and is designed to reference up to 8PB of
> > > > memory. The size of the tables to track this many referenced pages would be
> > > > a problem (at best).
> > > 
> > > Is it any worse a problem than the pagetables of the processes which have
> > > their virtual memory exported to GRU? AFAIKS, no; it is on the same
> > > magnitude of difficulty. So you could do it without introducing any
> > > fundamental problem (memory usage might be increased by some constant
> > > factor, but I think we can cope with that in order to make the core patch
> > > really nice and simple).
> > 
> > Functionally, the GRU is very close to what I would consider to be the
> > "standard TLB" model. Dropins and flushs map closely to processor dropins
> > and flushes for cpus.  The internal structure of the GRU TLB is identical to
> > the TLB of existing cpus.  Requiring the GRU driver to track dropins with
> > long term page references seems to me a deviation from having the basic
> > mmuops support a "standard TLB" model. AFAIK, no other processor requires
> > this.
> 
> That is because the CPU TLBs have the mmu_gather batching APIs which
> avoid the problem. It would be possible to do something similar for
> GRU which would involve taking a reference for each page-to-be-invalidated
> in invalidate_page, and release them when you invalidate_range. Or else
> do some other scheme which makes mmu notifiers work similarly to the
> mmu gather API. But not just go an invent something completely different
> in the form of this invalidate_begin,clear linux pte,invalidate_end API.

Correct. If the mmu_gather were passed on the mmuops callout and the callout were
done at the same point as the tlb_finish_mmu(), the GRU could
efficiently work w/o the range invalidates. A range invalidate might still
be slightly more efficient but not measureable so. The net difference is
not worth the extra complexity of range callouts.


> 
> 
> > Tracking TLB dropins (and long term page references) could be done but it
> > adds significant complexity and scaling issues. The size of the tables to
> > track many TB (to PB) of memory can get large. If the memory is being
> > referenced by highly threaded applications, then the problem becomes even
> > more complex. Either tables must be replicated per-thread (and require even
> > more memory), or the table structure becomes even more complex to deal with
> > node locality, cacheline bouncing, etc.
> 
> I don't think it would be that significant in terms of complexity or
> scaling.
> 
> For a quick solution, you could stick a radix tree in each of your mmu
> notifiers registered (ie. one per mm), which is indexed on virtual address
> >> PAGE_SHIFT, and returns the struct page *. Size is no different than
> page tables, and locking is pretty scalable.
> 
> After that, I would really like to see whether the numbers justify
> larger changes.

I'm still concerned about performance. Each dropin would first have to access
an additional data structure that would most likely be non-node-local and
non-cache-resident. The net effect would be measurable but not a killer.

I haven't thought about locking requirements for the radix tree. Most accesses
would be read-only & updates infrequent. Any chance of an RCU-based radix
implementation?  Otherwise, don't we add the potential for hot locks/cachelines
for threaded applications ???
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/