lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 28 Sep 2020 15:39:28 -0300
From:   Jason Gunthorpe <jgg@...pe.ca>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Peter Xu <peterx@...hat.com>, Leon Romanovsky <leonro@...dia.com>,
        John Hubbard <jhubbard@...dia.com>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Jan Kara <jack@...e.cz>, Michal Hocko <mhocko@...e.com>,
        Kirill Tkhai <ktkhai@...tuozzo.com>,
        Kirill Shutemov <kirill@...temov.name>,
        Hugh Dickins <hughd@...gle.com>,
        Christoph Hellwig <hch@....de>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 10:54:28AM -0700, Linus Torvalds wrote:
> On Mon, Sep 28, 2020 at 10:23 AM Peter Xu <peterx@...hat.com> wrote:
> >
> > Yes...  Actually I am also thinking about the complete solution to cover
> > read-only fast-gups too, but now I start to doubt this, at least for the fork()
> > path.  E.g. if we'd finally like to use pte_protnone() to replace the current
> > pte_wrprotect(), we'll be able to also block the read gups, but we'll suffer
> > the same degradation on normal fork()s, or even more.  Seems unacceptable.
> 
> So I think the real question about pinned read gups is what semantics
> they should have.
> 
> Because honestly, I think we have two options:
> 
>  - the current "it gets a shared copy from the page tables"
>
>  - the "this is an exclusive pin, and it _will_ follow the source VM
> changes, and never break"

The requirement depends on what the driver is doing. Consider a simple
ring-to-device scheme:

   ring = mmap()
   pin_user_pages(FOLL_LONGTERM)
   ring[0] = [..]
   trigger_kernel()

Sort of like iouring. Here the kernel will pin the zero page and will
never see any user modifications to the buffer. We must do #2.

While something like read(O_DIRECT) which only needs the buffer to be
stable during a system call would be fine with #1 (and data
incoherence in general)

> In other words, a read pin isn't really any different from a read GUP.
> You get a reference to a page that is valid at the time of the page
> lookup, and absolutely nothing more.

Yes, so far all this new pin stuff has really focused on the write
side - the motivating issue was the set_page_dirty() oops
 
> But honestly, that is largely going to _be_ the same as a write pin,
> because it absolutely needs to do a page COW at the time of the
> pinning to get that initial exclusive guarantee in the first place.
> Without that initial exclusivity, you cannot avoid future COW events
> breaking the wrong way.

Right, I see places using FOLL_WRITE when they only need read. It is a
bit ugly, IMHO.

> The downside of a write pin is that it not only makes that page
> exclusive, it also (a) marks it dirty and (b) requires write access.

RDMA adds FOLL_FORCE because people complained they couldn't do
read-only transfers from .rodata - uglyier still :\

> So the copy_page_range() code would do a write count around the copy:
> 
>     write_seqcount_begin(&mm->seq);
>     .. do the copy ..
>     write_seqcount_end(&mm->seq);

All of gup_fast and copy_mm could be wrappered in a seq count so that
gup_fast always goes to the slow path if fork is concurrent. 

That doesn't sound too expensive and avoids all the problems you
pointed with the WP scheme.

As you say fork & pin & threads is already questionable, so an
unconditional slow path on race should be OK.

> If we want to and really need to.

I would like to see some reasonable definition for the
read-side. Having drivers do FOLL_FORCE | FOLL_WRITE for read is just
confusing and weird for a driver facing API.

It may also help focus the remaining discussion for solving
set_page_dirty() if pin_user_pages() had a solid definition.

I prefer the version where read pin and write pin are symmetric. The
PTE in the MM should not change once pinned.

This is useful and if something only needs the original GUP semantics
then GUP is still there.

Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ