lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200915232238.GO1221970@ziepe.ca>
Date:   Tue, 15 Sep 2020 20:22:38 -0300
From:   Jason Gunthorpe <jgg@...pe.ca>
To:     Peter Xu <peterx@...hat.com>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Leon Romanovsky <leonro@...dia.com>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "Maya B . Gokhale" <gokhale2@...l.gov>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Marty Mcfadden <mcfadden8@...l.gov>,
        Kirill Shutemov <kirill@...temov.name>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Jan Kara <jack@...e.cz>, Kirill Tkhai <ktkhai@...tuozzo.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

On Tue, Sep 15, 2020 at 05:33:30PM -0400, Peter Xu wrote:

> > RDMA doesn't ever use !WRITE
> > 
> > I'm guessing #5 is the issue, just with a different ordering. If the
> > #3 pin_user_pages() preceeds the #2 fork, don't we get to the same #5?
> 
> Right, but only if without MADV_DONTFORK?

Yes, results are that MADV_DONTFORK resolves the issue for initial
tests. I should know if a bigger test suite passes in a few days.
 
> > > If this is a problem, we may still need the fix patch (maybe not as urgent as
> > > before at least).  But I'd like to double confirm, just in case I miss some
> > > obvious facts above.
> > 
> > I'm worred that the sudden need to have MAD_DONTFORK is going to be a
> > turn into a huge regression. It already blew up our first level of
> > synthetic test cases. I'm worried what we will see when the
> > application suite is run in a few months :\
> 
> For my own preference I'll consider changing kernel behavior if the impact is
> still under control (the performance report of 30%+ boost is also attractive
> after the simplify-cow patch).  The other way is to maintain the old reuse
> logic forever, that'll be another kind of burden.  Seems no easy way on either
> side...

It seems very strange that a physical page exclusively owned by a
process can become copied if pin_user_pages() is active and the
process did fork() at some point. 

Could the new pin_user_pages() logic help here? eg the
GUP_PIN_COUNTING_BIAS stuff?

Could the COW code consider a refcount of GUP_PIN_COUNTING_BIAS + 1 as
being owned by the current mm and not needing COW? The DMA pin would
be 'invisible' for COW purposes?

Perhaps an ideal thing would be to have fork() not set the write
protect on pages that have GUP_PIN_COUNTING_BIAS (ie are under DMA),
instead immediately copy during fork(). Then we don't get into this
situation and also don't need MADV_DONTFORK anymore (yay!!). Feels like
this could be low cost as fork() must already be touching the
refcount?

It looks like RDMA, media, vfio, vdpa, io_uring (!!) and xdp all use
FOLL_LONGTERM and may be at regression risk.

I can't say at this point the scope of the problem with RDMA.
*Technicaly* apps forking without MADV_DONTFORK are against the
defined programming model, but since the old kernel didn't fail
robustly there could be misses. FWIW the failure is catastrophic, the
app just breaks completely.

io_uring seems like something that would have interest to mix with
fork.. I see mentions of MADV_DONTFORK in io_uring documentation,
however it is not documented as a 'if you ever call fork() you have to
use this API'. That seems worrying.

> > > IMHO it worked because the page to do RDMA has mapcount==1, so it was reused
> > > previously just as-is even after the fork without MADV_DONTFORK and after the
> > > child quits.
> > 
> > That would match the results we see.. So this patch changes things so
> > it is not re-used as-is, but replaced with Y?
> 
> Yes. The patch lets "replaced with Y" (cow) happen earlier at step #3.  Then
> with MADV_DONTFORK, reuse should not happen any more.

?? Step #3 is pin_user_pages(), why would that replace with COW?

Thanks,
Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ