linux-kernel - Re: [PATCH 1/4] mm: Trial do_wp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200915191346.GD2949@xz-x1>
Date:   Tue, 15 Sep 2020 15:13:46 -0400
From:   Peter Xu <peterx@...hat.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Leon Romanovsky <leonro@...dia.com>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "Maya B . Gokhale" <gokhale2@...l.gov>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Marty Mcfadden <mcfadden8@...l.gov>,
        Kirill Shutemov <kirill@...temov.name>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Jan Kara <jack@...e.cz>, Kirill Tkhai <ktkhai@...tuozzo.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

On Tue, Sep 15, 2020 at 03:29:33PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 01:05:53PM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote:
> > > On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote:
> > > > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
> > > > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
> > > > of the page and MADV_DONTFORK was needed to ensure that a future fork
> > > > doesn't establish a COW that would break the DMA by moving the
> > > > physical page over to the fork. DMA should stay with the process that
> > > > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
> > > > recent years work to GUP/etc? It is a pretty terrible ancient thing)
> > > 
> > > ... Now I'm more confused on what has happened.
> > 
> > I'm going to try to confirm that the MADV_DONTFORK is actually being
> > done by userspace properly, more later.
> 
> It turns out the test is broken and does not call MADV_DONTFORK when
> doing forks - it is an opt-in it didn't do.
> 
> It looks to me like this patch makes it much more likely that the COW
> break after page pinning will end up moving the pinned physical page
> to the fork while before it was not very common. Does that make sense?

My understanding is that the fix should not matter much with current failing
test case, as long as it's with FOLL_FORCE & FOLL_WRITE.  However what I'm not
sure is what if the RDMA/DMA buffers are designed for pure read from userspace.

E.g. for vfio I'm looking at vaddr_get_pfn() where I believe such pure read
buffers will be a GUP with FOLL_PIN and !FOLL_WRITE which will finally pass to
pin_user_pages_remote().  So what I'm worrying is something like this:

  1. Proc A gets a private anon page X for DMA, mapcount==refcount==1.

  2. Proc A fork()s and gives birth to proc B, page X will now have
     mapcount==refcount==2, write-protected.  proc B quits.  Page X goes back
     to mapcount==refcount==1 (note! without WRITE bits set in the PTE).

  3. pin_user_pages(write=false) for page X.  Since it's with !FORCE & !WRITE,
     no COW needed.  Refcount==2 after that.

  4. Pass these pages to device.  We either setup IOMMU page table or just use
     the PFNs, which is not important imho - the most important thing is the
     device will DMA into page X no matter what.

  5. Some thread of proc A writes to page X, trigger COW since it's
     write-protected with mapcount==1 && refcount==2.  The HVA that pointing to
     page X will be changed to point to another page Y after the COW.

  6. Device DMA happens, data resides on X.  Proc A can never get the data,
     though, because it's looking at page Y now.

If this is a problem, we may still need the fix patch (maybe not as urgent as
before at least).  But I'd like to double confirm, just in case I miss some
obvious facts above.

> 
> Given that the tests are wrong it seems like broken userspace,
> however, it also worked reliably for a fairly long time.

IMHO it worked because the page to do RDMA has mapcount==1, so it was reused
previously just as-is even after the fork without MADV_DONTFORK and after the
child quits.  However logically it should really be protected by MADV_DONTFORK
rather than being reused.

Thanks,

-- 
Peter Xu