[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200915232238.GO1221970@ziepe.ca>
Date: Tue, 15 Sep 2020 20:22:38 -0300
From: Jason Gunthorpe <jgg@...pe.ca>
To: Peter Xu <peterx@...hat.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Leon Romanovsky <leonro@...dia.com>,
Linux-MM <linux-mm@...ck.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
"Maya B . Gokhale" <gokhale2@...l.gov>,
Yang Shi <yang.shi@...ux.alibaba.com>,
Marty Mcfadden <mcfadden8@...l.gov>,
Kirill Shutemov <kirill@...temov.name>,
Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
Jan Kara <jack@...e.cz>, Kirill Tkhai <ktkhai@...tuozzo.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Christoph Hellwig <hch@....de>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification
On Tue, Sep 15, 2020 at 05:33:30PM -0400, Peter Xu wrote:
> > RDMA doesn't ever use !WRITE
> >
> > I'm guessing #5 is the issue, just with a different ordering. If the
> > #3 pin_user_pages() preceeds the #2 fork, don't we get to the same #5?
>
> Right, but only if without MADV_DONTFORK?
Yes, results are that MADV_DONTFORK resolves the issue for initial
tests. I should know if a bigger test suite passes in a few days.
> > > If this is a problem, we may still need the fix patch (maybe not as urgent as
> > > before at least). But I'd like to double confirm, just in case I miss some
> > > obvious facts above.
> >
> > I'm worred that the sudden need to have MAD_DONTFORK is going to be a
> > turn into a huge regression. It already blew up our first level of
> > synthetic test cases. I'm worried what we will see when the
> > application suite is run in a few months :\
>
> For my own preference I'll consider changing kernel behavior if the impact is
> still under control (the performance report of 30%+ boost is also attractive
> after the simplify-cow patch). The other way is to maintain the old reuse
> logic forever, that'll be another kind of burden. Seems no easy way on either
> side...
It seems very strange that a physical page exclusively owned by a
process can become copied if pin_user_pages() is active and the
process did fork() at some point.
Could the new pin_user_pages() logic help here? eg the
GUP_PIN_COUNTING_BIAS stuff?
Could the COW code consider a refcount of GUP_PIN_COUNTING_BIAS + 1 as
being owned by the current mm and not needing COW? The DMA pin would
be 'invisible' for COW purposes?
Perhaps an ideal thing would be to have fork() not set the write
protect on pages that have GUP_PIN_COUNTING_BIAS (ie are under DMA),
instead immediately copy during fork(). Then we don't get into this
situation and also don't need MADV_DONTFORK anymore (yay!!). Feels like
this could be low cost as fork() must already be touching the
refcount?
It looks like RDMA, media, vfio, vdpa, io_uring (!!) and xdp all use
FOLL_LONGTERM and may be at regression risk.
I can't say at this point the scope of the problem with RDMA.
*Technicaly* apps forking without MADV_DONTFORK are against the
defined programming model, but since the old kernel didn't fail
robustly there could be misses. FWIW the failure is catastrophic, the
app just breaks completely.
io_uring seems like something that would have interest to mix with
fork.. I see mentions of MADV_DONTFORK in io_uring documentation,
however it is not documented as a 'if you ever call fork() you have to
use this API'. That seems worrying.
> > > IMHO it worked because the page to do RDMA has mapcount==1, so it was reused
> > > previously just as-is even after the fork without MADV_DONTFORK and after the
> > > child quits.
> >
> > That would match the results we see.. So this patch changes things so
> > it is not re-used as-is, but replaced with Y?
>
> Yes. The patch lets "replaced with Y" (cow) happen earlier at step #3. Then
> with MADV_DONTFORK, reuse should not happen any more.
?? Step #3 is pin_user_pages(), why would that replace with COW?
Thanks,
Jason
Powered by blists - more mailing lists