linux-kernel - Re: [PATCH 2/5] kernel.h: Add non_block

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190815163238.GA30781@redhat.com>
Date:   Thu, 15 Aug 2019 12:32:38 -0400
From:   Jerome Glisse <jglisse@...hat.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
Cc:     Daniel Vetter <daniel.vetter@...ll.ch>,
        Michal Hocko <mhocko@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>,
        DRI Development <dri-devel@...ts.freedesktop.org>,
        Intel Graphics Development <intel-gfx@...ts.freedesktop.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        Christian König <christian.koenig@....com>,
        Masahiro Yamada <yamada.masahiro@...ionext.com>,
        Wei Wang <wvw@...gle.com>,
        Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Jann Horn <jannh@...gle.com>, Feng Tang <feng.tang@...el.com>,
        Kees Cook <keescook@...omium.org>,
        Randy Dunlap <rdunlap@...radead.org>,
        Daniel Vetter <daniel.vetter@...el.com>
Subject: Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> 
> > You have to wait for the gpu to finnish current processing in
> > invalidate_range_start. Otherwise there's no point to any of this
> > really. So the wait_event/dma_fence_wait are unavoidable really.
> 
> I don't envy your task :|
> 
> But, what you describe sure sounds like a 'registration cache' model,
> not the 'shadow pte' model of coherency.
> 
> The key difference is that a regirstationcache is allowed to become
> incoherent with the VMA's because it holds page pins. It is a
> programming bug in userspace to change VA mappings via mmap/munmap/etc
> while the device is working on that VA, but it does not harm system
> integrity because of the page pin.
> 
> The cache ensures that each initiated operation sees a DMA setup that
> matches the current VA map when the operation is initiated and allows
> expensive device DMA setups to be re-used.
> 
> A 'shadow pte' model (ie hmm) *really* needs device support to
> directly block DMA access - ie trigger 'device page fault'. ie the
> invalidate_start should inform the device to enter a fault mode and
> that is it.  If the device can't do that, then the driver probably
> shouldn't persue this level of coherency. The driver would quickly get
> into the messy locking problems like dma_fence_wait from a notifier.

I think here we do not agree on the hardware requirement. For GPU
we will always need to be able to wait for some GPU fence from inside
the notifier callback, there is just no way around that for many of
the GPUs today (i do not see any indication of that changing).

Driver should avoid lock complexity by using wait queue so that the
driver notifier callback can wait without having to hold some driver
lock. However there will be at least one lock needed to update the
internal driver state for the range being invalidated. That lock is
just the device driver page table lock for the GPU page table
associated with the mm_struct. In all GPU driver so far it is a short
lived lock and nothing blocking is done while holding it (it is just
about updating page table directory really wether it is filling it or
clearing it).


> 
> It is important to identify what model you are going for as defining a
> 'registration cache' coherence expectation allows the driver to skip
> blocking in invalidate_range_start. All it does is invalidate the
> cache so that future operations pick up the new VA mapping.
> 
> Intel's HFI RDMA driver uses this model extensively, and I think it is
> well proven, within some limitations of course.
> 
> At least, 'registration cache' is the only use model I know of where
> it is acceptable to skip invalidate_range_end.

Here GPU are not in the registration cache model, i know it might looks
like it because of GUP but GUP was use just because hmm did not exist
at the time.

Cheers,
Jérôme