[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200917193824.GL8409@ziepe.ca>
Date: Thu, 17 Sep 2020 16:38:24 -0300
From: Jason Gunthorpe <jgg@...pe.ca>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Peter Xu <peterx@...hat.com>, John Hubbard <jhubbard@...dia.com>,
Leon Romanovsky <leonro@...dia.com>,
Linux-MM <linux-mm@...ck.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
"Maya B . Gokhale" <gokhale2@...l.gov>,
Yang Shi <yang.shi@...ux.alibaba.com>,
Marty Mcfadden <mcfadden8@...l.gov>,
Kirill Shutemov <kirill@...temov.name>,
Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
Jan Kara <jack@...e.cz>, Kirill Tkhai <ktkhai@...tuozzo.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Christoph Hellwig <hch@....de>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification
On Thu, Sep 17, 2020 at 11:11:06AM -0700, Linus Torvalds wrote:
> (a) if the pinner is going to change the page, it will have to get
> the pin with FOLL_WRITE in addition to FOLL_PIN
>
> (b) that means it will do the COW and mark the page writable and dirty
Yep
> (c) the whole _point_ of the FOLL_PIN is that subsequent operations
> shouldn't make it non-writable any more (ie it can't be unmapped, and
> we should synchronize on fork etc)
It is the ideal, but FOLL_PIN has been troubled for a long time:
https://lwn.net/Articles/753027/
ie writeprotect is known to happen due to writeback. I had understood
that fork() will also cause write protect. Eg
copy_process()
copy_mm()
dup_mm()
dup_mmap()
copy_page_range()
[..]
copy_one_pte()
Gets to:
if (is_cow_mapping(vm_flags) && pte_write(pte)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
Which blindly write protects a FOLL_PIN page. Here src_pte will be
under a pin.
I *think* the failing test is basically:
1) pin_user_pages(mem, FOLL_FORCE | FOLL_WRITE)
2) pid = fork()
3) child: does a bit, then exec
4) parent: waitpid(pid)
5) *mem = 0
Here #5 is the WP triggered COW. We know a WP triggered COW is
happening from the bisect and success result with MADV_DONTFORK.
#2 triggers the dup_mmap() and the ptep_set_wrprotect() (From
inspection, at least)
This "Trial do_wp_page() simplification" patch means that when #5 goes
to do COW it gets a copy instead of a re-use because the
reuse_swap_page() was aborting the copy before.
So, to your point, yes ideally FOLL_PIN would never write-protect
pages!
Looking for awhile, this now looks reasonable and
doable. page_maybe_dma_pinned() was created for exactly this kind of
case.
I've attached a dumb sketch for the pte level (surely wrong! I have
never looked at this part of the mm before!) at the end of this
message.
Peter, do you know this better? Does this inspire you to make a patch? :)
Would really love to see this. We have such a huge expensive mess with
MADV_DONTFORK, this would eliminate all that overhead.
Thanks,
Jason
diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76e1..6bc19a43da1391 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -689,6 +689,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
}
#endif
+static int duplicate_page(pte_t *newpte, struct vm_area_struct *vma,
+ unsigned long address, struct page *page)
+{
+ struct page *new_page;
+
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!new_page)
+ return -ENOMEM;
+ copy_user_highpage(new_page, page, address, vma);
+
+ /* FIXME: surely more than this */
+ *newpte = mk_pte(new_page, vma->vm_page_prot);
+ return 0;
+}
+
/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
@@ -703,6 +718,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ bool do_src_wp;
/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -775,15 +791,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_set_pte;
}
- /*
- * If it's a COW mapping, write protect it both
- * in the parent and the child
- */
- if (is_cow_mapping(vm_flags) && pte_write(pte)) {
- ptep_set_wrprotect(src_mm, addr, src_pte);
- pte = pte_wrprotect(pte);
- }
-
/*
* If it's a shared mapping, mark it clean in
* the child
@@ -800,11 +807,34 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (!(vm_flags & VM_UFFD_WP))
pte = pte_clear_uffd_wp(pte);
+ do_src_wp = is_cow_mapping(vm_flags) && pte_write(pte);
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
page_dup_rmap(page, false);
rss[mm_counter(page)]++;
+
+ /*
+ * If a page is DMA pinned we never want to write protect it,
+ * copy it now.
+ */
+ if (do_src_wp && page_maybe_dma_pinned(page)) {
+ do_src_wp = false;
+ ret = duplicate_page(&pte, vma, addr, page);
+ if (ret)
+ /* FIXME: need to restructure a bit to handle this */
+ return ret;
+ }
+ }
+
+ /*
+ * If it's a COW mapping, write protect it both
+ * in the parent and the child
+ * FIXME check carefully this is new order is OK
+ */
+ if (do_src_wp) {
+ ptep_set_wrprotect(src_mm, addr, src_pte);
+ pte = pte_wrprotect(pte);
}
out_set_pte:
Powered by blists - more mailing lists