lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200917193824.GL8409@ziepe.ca>
Date:   Thu, 17 Sep 2020 16:38:24 -0300
From:   Jason Gunthorpe <jgg@...pe.ca>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Peter Xu <peterx@...hat.com>, John Hubbard <jhubbard@...dia.com>,
        Leon Romanovsky <leonro@...dia.com>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "Maya B . Gokhale" <gokhale2@...l.gov>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Marty Mcfadden <mcfadden8@...l.gov>,
        Kirill Shutemov <kirill@...temov.name>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Jan Kara <jack@...e.cz>, Kirill Tkhai <ktkhai@...tuozzo.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

On Thu, Sep 17, 2020 at 11:11:06AM -0700, Linus Torvalds wrote:
>  (a) if the pinner is going to change the page, it will have to get
> the pin with FOLL_WRITE in addition to FOLL_PIN
> 
>  (b) that means it will do the COW and mark the page writable and dirty

Yep
 
>  (c) the whole _point_ of the FOLL_PIN is that subsequent operations
> shouldn't make it non-writable any more (ie it can't be unmapped, and
> we should synchronize on fork etc)

It is the ideal, but FOLL_PIN has been troubled for a long time:
https://lwn.net/Articles/753027/

ie writeprotect is known to happen due to writeback. I had understood
that fork() will also cause write protect. Eg

copy_process()
 copy_mm()
  dup_mm()
   dup_mmap()
    copy_page_range()
     [..]
      copy_one_pte()

Gets to:

	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
		ptep_set_wrprotect(src_mm, addr, src_pte);
		pte = pte_wrprotect(pte);
	}

Which blindly write protects a FOLL_PIN page. Here src_pte will be
under a pin.

I *think* the failing test is basically:

 1) pin_user_pages(mem, FOLL_FORCE | FOLL_WRITE)
 2) pid = fork()
 3) child: does a bit, then exec
 4) parent: waitpid(pid)
 5) *mem = 0

Here #5 is the WP triggered COW. We know a WP triggered COW is
happening from the bisect and success result with MADV_DONTFORK.

#2 triggers the dup_mmap() and the ptep_set_wrprotect() (From
inspection, at least)

This "Trial do_wp_page() simplification" patch means that when #5 goes
to do COW it gets a copy instead of a re-use because the
reuse_swap_page() was aborting the copy before.

So, to your point, yes ideally FOLL_PIN would never write-protect
pages!

Looking for awhile, this now looks reasonable and
doable. page_maybe_dma_pinned() was created for exactly this kind of
case.

I've attached a dumb sketch for the pte level (surely wrong! I have
never looked at this part of the mm before!) at the end of this
message. 

Peter, do you know this better? Does this inspire you to make a patch? :)

Would really love to see this. We have such a huge expensive mess with
MADV_DONTFORK, this would eliminate all that overhead.

Thanks,
Jason

diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76e1..6bc19a43da1391 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -689,6 +689,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 }
 #endif
 
+static int duplicate_page(pte_t *newpte, struct vm_area_struct *vma,
+			  unsigned long address, struct page *page)
+{
+	struct page *new_page;
+
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (!new_page)
+		return -ENOMEM;
+	copy_user_highpage(new_page, page, address, vma);
+
+	/* FIXME: surely more than this */
+	*newpte = mk_pte(new_page, vma->vm_page_prot);
+	return 0;
+}
+
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
@@ -703,6 +718,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	bool do_src_wp;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -775,15 +791,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_set_pte;
 	}
 
-	/*
-	 * If it's a COW mapping, write protect it both
-	 * in the parent and the child
-	 */
-	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-		pte = pte_wrprotect(pte);
-	}
-
 	/*
 	 * If it's a shared mapping, mark it clean in
 	 * the child
@@ -800,11 +807,34 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!(vm_flags & VM_UFFD_WP))
 		pte = pte_clear_uffd_wp(pte);
 
+	do_src_wp = is_cow_mapping(vm_flags) && pte_write(pte);
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, false);
 		rss[mm_counter(page)]++;
+
+		/*
+		 * If a page is DMA pinned we never want to write protect it,
+		 * copy it now.
+		 */
+		if (do_src_wp && page_maybe_dma_pinned(page)) {
+			do_src_wp = false;
+			ret = duplicate_page(&pte, vma, addr, page);
+			if (ret)
+			     /* FIXME: need to restructure a bit to handle this */
+			     return ret;
+		}
+	}
+
+	/*
+	 * If it's a COW mapping, write protect it both
+	 * in the parent and the child
+	 * FIXME check carefully this is new order is OK
+	 */
+	if (do_src_wp) {
+		ptep_set_wrprotect(src_mm, addr, src_pte);
+		pte = pte_wrprotect(pte);
 	}
 
 out_set_pte:

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ