linux-kernel - Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140912192100.GB5196@gmail.com>
Date:	Fri, 12 Sep 2014 15:21:01 -0400
From:	Jerome Glisse <j.glisse@...il.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Joerg Roedel <joro@...tes.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Rik van Riel <riel@...hat.com>, jroedel@...e.de,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>, John.Bridgman@....com,
	Jesse Barnes <jbarnes@...tuousgeek.org>,
	Hugh Dickins <hughd@...gle.com>, linux-kernel@...r.kernel.org,
	ben.sander@....com, linux-mm@...ck.org,
	Jerome Glisse <jglisse@...hat.com>, Jay.Cornwall@....com,
	Mel Gorman <mgorman@...e.de>,
	David Woodhouse <dwmw2@...radead.org>,
	Johannes Weiner <jweiner@...hat.com>,
	iommu@...ts.linux-foundation.org
Subject: Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs

On Fri, Sep 12, 2014 at 12:10:36PM -0700, Andrew Morton wrote:
> On Wed, 10 Sep 2014 20:02:12 -0400 Jerome Glisse <j.glisse@...il.com> wrote:
> 
> > On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote:
> > > On Tue,  9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@...tes.org> wrote:
> > > 
> > > > here is a patch-set to extend the mmu_notifiers in the Linux
> > > > kernel to allow managing CPU external TLBs. Those TLBs may
> > > > be implemented in IOMMUs or any other external device, e.g.
> > > > ATS/PRI capable PCI devices.
> > > > 
> > > > The problem with managing these TLBs are the semantics of
> > > > the invalidate_range_start/end call-backs currently
> > > > available. Currently the subsystem using mmu_notifiers has
> > > > to guarantee that no new TLB entries are established between
> > > > invalidate_range_start/end. Furthermore the
> > > > invalidate_range_start() function is called when all pages
> > > > are still mapped and invalidate_range_end() when the pages
> > > > are unmapped an already freed.
> > > > 
> > > > So both call-backs can't be used to safely flush any non-CPU
> > > > TLB because _start() is called too early and _end() too
> > > > late.
> > > 
> > > There's a lot of missing information here.  Why don't the existing
> > > callbacks suit non-CPU TLBs?  What is different about them?  Please
> > > update the changelog to contain all this context.
> > 
> > So unlike KVM or any current user of the mmu_notifier api, any PCIE
> > ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing
> > name for Intel (probably VTd) implementation walk the cpu page table
> > on there own and have there own TLB cache. In fact not only the iommu
> > can have a TLB cache but any single PCIE hardware can implement its
> > own local TLB cache.
> > 
> > So if we flush the IOMMU and device TLB inside the range_start callback
> > there is a chance that the hw will just rewalk the cpu page table and
> > repopulate its TLB before the CPU page table is actually updated.
> > 
> > Now if we shoot down the TLB inside the range_end callback, then we
> > are too late ie the CPU page table is already populated with new entry
> > and all the TLB in the IOMMU an in device might be pointing to the old
> > pages.
> > 
> > So the aim of this callback is to happen right after the CPU page table
> > is updated but before the old page is freed or recycled.
> 
> You had me up to here.  What is "the old page"?  pagetable page, I
> assume?  upper, middle, lower, all the above?  Why does the callback
> function need to get at that page?

No old page is not a page from the cpu page table directory free but a page
that use to belong to the process ie a page that was backing an address range
of the process we are interested in. The kernel flush the cpu tlb and then
free those pages before calling the mmu_notifier end callback (for good reasons).

So we want to flush the hardware TLB (whish is distinc from CPU TLB) before
any of the page that use to be valid page backing address range of the process
are freed or recycle.

Sorry for the confusion my wording caused.

> 
> > > Also too skimpy.  I *think* this is a variant of the problem in the
> > > preceding paragraph.  We get a fault storm (which is problem 2) and
> > > sometimes the faulting device gives up (which is problem 1).
> > > 
> > > Or something.  Please de-fog all of this.
> > > 
> > 
> > Does above explanation help understand the issue ? Given that on each
> > device page fault an IRQ is trigger (well the way the hw works is bit
> > more complex than that).
> 
> It helps.  Please understand that my main aim here is not to ask
> specific questions.  I seek to demonstrate the level of detail which I
> believe should be in the changelog so that others can effectively
> understand and review this patchset, and to understand this code in the
> future.

Totaly agree and i am rooting for better commit message in other kernel
component, since i am reviewing more patches in distinct component i now
clearly see the value of good and exhaustive commit message.

> 
> So it would be good if you were to update the changelog to answer these
> questions.  It would be better if it were also to answer similar
> questions which I didn't think to ask!
> 
> > > > Furthermore the _start()/end() notifiers only catch the
> > > > moment when page mappings are released, but not page-table
> > > > pages. But this is necessary for managing external TLBs when
> > > > the page-table is shared with the CPU.
> > > 
> > > How come?
> > 
> > As explained above end() might happens after page that were previously
> > mapped are free or recycled.
> 
> Again, some explanation of why the driver wishes to inspect the
> pagetable pages would be useful.
> 
> > > 
> > > > To solve this situation I wrote a patch-set to introduce a
> > > > new notifier call-back: mmu_notifer_invalidate_range(). This
> > > > notifier lifts the strict requirements that no new
> > > > references are taken in the range between _start() and
> > > > _end(). When the subsystem can't guarantee that any new
> > > > references are taken is has to provide the
> > > > invalidate_range() call-back to clear any new references in
> > > > there.
> > > > 
> > > > It is called between invalidate_range_start() and _end()
> > > > every time the VMM has to wipe out any references to a
> > > > couple of pages. This are usually the places where the CPU
> > > > TLBs are flushed too and where its important that this
> > > > happens before invalidate_range_end() is called.
> > > > 
> > > > Any comments and review appreciated!
> > > 
> > > The patchset looks decent, although I find it had to review because I
> > > just wasn't provided with enough of the thinking that went into it.  I
> > > have enough info to look at the C code, but not enough info to identify
> > > and evaluate alternative implementation approaches, to identify
> > > possible future extensions, etc.
> > > 
> > > The patchset does appear to add significant additional overhead to hot
> > > code paths when mm_has_notifiers(mm).  Please let's update the
> > > changelog to address this rather important concern.  How significant is
> > > the impact on such mm's, how common are such mm's now and in the
> > > future, should we (for example) look at short-circuiting
> > > __mmu_notifier_invalidate_range() if none of the registered notifiers
> > > implement ->invalidate_range(), etc.
> > 
> > So one might feel like just completely removing the range_start()/end()
> > from the mmu_notifier and stick to this one callback but it will not work
> > with other hardware like the one i am doing HMM patchset for (i send it
> > again a couple weeks ago https://lkml.org/lkml/2014/8/29/423).
> > 
> > Right now there are very few user of the mmu_notifier, SGI, xen, KVM and
> > sadly recently the GPU folks (which i am part of too). So as the GPU are
> > starting to use it we will see a lot more application going through the
> > mmu_notifier callback. Yet you do want to leverage the hw like GPU and
> > you do want to use the same address space on the GPU as on the CPU and
> > thus you do want to share or at least keep synchronize the GPU view of
> > the CPU page table.
> > 
> > Right now, for IOMMUv2 this means adding this callback, for device that
> > do not rely on ATS/PASID PCIE extension this means something like HMM.
> > Also note that HMM is a superset of IOMMUv2 as it could be use at the
> > same time and provide more feature mainly allowing migrating some page
> > to device local memory for performances purposes.
> 
> OK, but none of that addressed my question regarding the performance
> cost.

I think Joerg addressed the performances concern in his email, basicly,
CPU TLB flush is already a costly operation and the hardware TLB flush
should just barely register a noise (this is current theory but hard to
say until we have a full working stack including userspace with which
proper and extensive benchmark and profiling can be perform).

> 
> > I think this sumup all motivation behind this patchset and also behind
> > my other patchset. As usual i am happy to discuss alternative way to do
> > things but i think that the path of least disruption from current code
> > is the one implemented by those patchset.
> 
> "least disruption" is nice, but "good implementation" is better.  In
> other words I'd prefer the more complex implementation if the end
> result is better.  Depending on the magnitudes of "complex" and
> "better" :) Two years from now (which isn't very long), I don't think
> we'll regret having chosen the better implementation.
> 
> Does this patchset make compromises to achieve low disruption?

Well right now i think we are lacking proper userspace support with which
this code and the global new usecase can be stress tested allowing to gather
profiling information.

I think as a first step we should take this least disruptive path and if
it proove to perform badly then we should work toward possibly more complex
design. Note that only complex design i can think of involve an overhaul of
how process memory management is done and probably linking cpu page table
update with the scheduler to try to hide cost of those update by scheduling
other thread meanwhile.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/