linux-kernel - Re: [PATCH 2/5] kernel.h: Add non_block

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190815185743.GQ21596@ziepe.ca>
Date:   Thu, 15 Aug 2019 15:57:43 -0300
From:   Jason Gunthorpe <jgg@...pe.ca>
To:     Jerome Glisse <jglisse@...hat.com>
Cc:     Daniel Vetter <daniel.vetter@...ll.ch>,
        Michal Hocko <mhocko@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>,
        DRI Development <dri-devel@...ts.freedesktop.org>,
        Intel Graphics Development <intel-gfx@...ts.freedesktop.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        Christian König <christian.koenig@....com>,
        Masahiro Yamada <yamada.masahiro@...ionext.com>,
        Wei Wang <wvw@...gle.com>,
        Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Jann Horn <jannh@...gle.com>, Feng Tang <feng.tang@...el.com>,
        Kees Cook <keescook@...omium.org>,
        Randy Dunlap <rdunlap@...radead.org>,
        Daniel Vetter <daniel.vetter@...el.com>
Subject: Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

On Thu, Aug 15, 2019 at 02:27:19PM -0400, Jerome Glisse wrote:
> > How exactly? This is holding the page pin, so the only way the VA
> > mapping can be changed is via explicit user action.
> > 
> > ie:
> > 
> >    gpu_write_something(va, size)
> >    mmap(.., va, size, MMAP_FIXED);
> >    gpu_wait_done()
> > 
> > This is racy and indeterminate with both models.
> > 
> > Based on the comment in i915 it appears to be going on the model that
> > changes to the mmap by userspace when the GPU is working on it is a
> > programming bug. This is reasonable, lots of systems use this kind of
> > consistency model.
> 
> Well userspace process doing munmap(), mremap(), fork() and things like
> that are a bug from the i915 kernel and userspace contract point of view.
> 
> But things like migration or reclaim are not cover under that contract
> and for those the expectation is that CPU access to the same virtual address
> should allow to get what was last written to it either by the GPU or the
> CPU.

Okay, this is a more reasonable point - I agree the i915 registration
cache model precludes using migration and thus DEVICE_PRIVATE. This is
a strong motivation to use the hmm approach

But we started out this converstation asking if i915 is correct, and I
still say a registration cache model is a functionally correct way to
use notifiers.

> Because of the reference on the page the i915 driver can forego the mmu
> notifier end callback. The thing here is that taking a page reference
> is pointless if we have better synchronization and tracking of mmu
> notifier. Hence converting to hmm mirror allows to avoid taking a ref
> on the page while still keeping the same functionality as of today.

However, there is a huge trade off here. Drivers like this are going
to have a very complicated locking inside invalidate_range_start as
they must sleep waiting for dma buffer references to go to zero.

> GPU driver have complex usage pattern the tlb shootdown is implicit
> once the GEM object associated with the uptr is invalidated it means
> next time userspace submit command against that GEM object it will
> have to re-validate it which means re-program the GPU page table to
> point to the proper address (and re-call GUP).

I think it is a mistake to try and cram the very different approaches
to notifiers into the same API. SW ref counting of DMA buffers vs HW
async page faulting have totally different requirements and locking
schemes.

This explains why AMDGPU gets away with not using the hmm API
properly, it is probably relying on its DMA refcount, not the hmm
valid, to keep things in order?

I think the API approach in hmm_mirror is reasonable for page faulting
HW, but does not serve refcounting HW well at all.

Jason