lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 5 May 2016 16:39:24 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Alex Williamson <alex.williamson@...hat.com>
Cc:	"Kirill A. Shutemov" <kirill@...temov.name>,
	kirill.shutemov@...ux.intel.com, linux-kernel@...r.kernel.org,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [BUG] vfio device assignment regression with THP ref counting
 redesign

Hello Alex,

On Wed, May 04, 2016 at 07:19:27PM -0600, Alex Williamson wrote:
> On Mon, 2 May 2016 20:03:07 +0200
> Andrea Arcangeli <aarcange@...hat.com> wrote:
> 
> > On Mon, May 02, 2016 at 07:00:42PM +0300, Kirill A. Shutemov wrote:
> > > Agreed. I just didn't see the two-refcounts solution.  
> > 
> > If you didn't do it already or if you're busy with something else,
> > I can change the patch to the two refcount solution, which should
> > restore the old semantics without breaking rmap.
> 
> I didn't see any follow-up beyond this nor patches on lkml.  Do we have
> something we feel confident for posting to v4.6 with a stable backport
> to v4.5?  Thanks,

I'm currently testing this:

>From c327b17f4de0c968bb3b9035fe36d80b2c28b2f8 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@...hat.com>
Date: Fri, 29 Apr 2016 01:05:06 +0200
Subject: [PATCH 1/3] mm: thp: calculate the mapcount correctly for THP pages
 during WP faults

This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.

Kirill A. Shutemov reported the problem with moving the page anon_vma
to the local vma->anon_vma in a previous version of this patch.

Signed-off-by: Andrea Arcangeli <aarcange@...hat.com>
---
 include/linux/mm.h   |  6 +++++
 include/linux/swap.h |  8 ++++---
 mm/huge_memory.c     | 67 +++++++++++++++++++++++++++++++++++++++++++++-------
 mm/memory.c          | 21 +++++++++-------
 mm/swapfile.c        | 13 +++++-----
 5 files changed, 89 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a55e5be..4b532d8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -500,11 +500,17 @@ static inline int page_mapcount(struct page *page)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int total_mapcount(struct page *page);
+int page_trans_huge_mapcount(struct page *page, int *total_mapcount);
 #else
 static inline int total_mapcount(struct page *page)
 {
 	return page_mapcount(page);
 }
+static inline int page_trans_huge_mapcount(struct page *page,
+					   int *total_mapcount)
+{
+	return *total_mapcount = page_mapcount(page);
+}
 #endif
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2b83359..acef20d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -418,7 +418,7 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
-extern int reuse_swap_page(struct page *);
+extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
@@ -513,8 +513,10 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-#define reuse_swap_page(page) \
-	(!PageTransCompound(page) && page_mapcount(page) == 1)
+static inline bool reuse_swap_page(struct page *page, int *total_mapcount)
+{
+	return page_trans_huge_mapcount(page, total_mapcount) == 1;
+}
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86f9f8b..d368620 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
 	 * We can only reuse the page if nobody else maps the huge page or it's
-	 * part. We can do it by checking page_mapcount() on each sub-page, but
-	 * it's expensive.
-	 * The cheaper way is to check page_count() to be equal 1: every
-	 * mapcount takes page reference reference, so this way we can
-	 * guarantee, that the PMD is the only mapping.
-	 * This can give false negative if somebody pinned the page, but that's
-	 * fine.
+	 * part.
 	 */
-	if (page_mapcount(page) == 1 && page_count(page) == 1) {
+	if (page_trans_huge_mapcount(page, NULL) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -2080,7 +2074,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (pte_write(pteval)) {
 			writable = true;
 		} else {
-			if (PageSwapCache(page) && !reuse_swap_page(page)) {
+			if (PageSwapCache(page) &&
+			    !reuse_swap_page(page, NULL)) {
 				unlock_page(page);
 				result = SCAN_SWAP_CACHE_PAGE;
 				goto out;
@@ -3225,6 +3220,60 @@ int total_mapcount(struct page *page)
 }
 
 /*
+ * This calculates accurately how many mappings a transparent hugepage
+ * has (unlike page_mapcount() which isn't fully accurate). This full
+ * accuracy is primarily needed to know if copy-on-write faults can
+ * takeover the page and change the mapping to read-write instead of
+ * copying them. At the same time this returns the total_mapcount too.
+ *
+ * The return value is telling if the page can be reused as it returns
+ * the highest mapcount any one of the subpages has. If the return
+ * value is one, even if different processes are mapping different
+ * subpages of the transparent hugepage, they can all reuse it,
+ * because each process is reusing a different subpage.
+ *
+ * The total_mapcount is instead counting all virtual mappings of the
+ * subpages. If the total_mapcount is equal to "one", it tells the
+ * caller all mappings belong to the same "mm" and in turn the
+ * anon_vma of the transparent hugepage can become the vma->anon_vma
+ * local one as no other process may be mapping any of the subpages.
+ *
+ * It would be more accurate to replace page_mapcount() with
+ * page_trans_huge_mapcount(), however we only use
+ * page_trans_huge_mapcount() in the copy-on-write faults where we
+ * need full accuracy to avoid breaking page pinning, because
+ * page_trans_huge_mapcount is slower than page_mapcount().
+ */
+int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
+{
+	int i, ret, _total_mapcount, mapcount;
+
+	/* hugetlbfs shouldn't call it */
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+
+	if (likely(!PageTransCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+
+	page = compound_head(page);
+
+	_total_mapcount = ret = 0;
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		mapcount = atomic_read(&page[i]._mapcount) + 1;
+		ret = max(ret, mapcount);
+		_total_mapcount += mapcount;
+	}
+	if (PageDoubleMap(page)) {
+		ret -= 1;
+		_total_mapcount -= HPAGE_PMD_NR;
+	}
+	mapcount = compound_mapcount(page);
+	ret += mapcount;
+	_total_mapcount += mapcount;
+	*total_mapcount = _total_mapcount;
+	return ret;
+}
+
+/*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
  *
diff --git a/mm/memory.c b/mm/memory.c
index 93897f2..1589aa4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2340,6 +2340,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * not dirty accountable.
 	 */
 	if (PageAnon(old_page) && !PageKsm(old_page)) {
+		int total_mapcount;
 		if (!trylock_page(old_page)) {
 			get_page(old_page);
 			pte_unmap_unlock(page_table, ptl);
@@ -2354,13 +2355,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			}
 			put_page(old_page);
 		}
-		if (reuse_swap_page(old_page)) {
-			/*
-			 * The page is all ours.  Move it to our anon_vma so
-			 * the rmap code will not search our parent or siblings.
-			 * Protected against the rmap code by the page lock.
-			 */
-			page_move_anon_rmap(old_page, vma, address);
+		if (reuse_swap_page(old_page, &total_mapcount)) {
+			if (total_mapcount == 1) {
+				/*
+				 * The page is all ours. Move it to
+				 * our anon_vma so the rmap code will
+				 * not search our parent or siblings.
+				 * Protected against the rmap code by
+				 * the page lock.
+				 */
+				page_move_anon_rmap(old_page, vma, address);
+			}
 			unlock_page(old_page);
 			return wp_page_reuse(mm, vma, address, page_table, ptl,
 					     orig_pte, old_page, 0, 0);
@@ -2584,7 +2589,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 83874ec..031713ab 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -922,18 +922,19 @@ out:
  * to it.  And as a side-effect, free up its swap: because the old content
  * on disk will never be read, and seeking back there to write new content
  * later would only waste time away from clustering.
+ *
+ * NOTE: total_mapcount should not be relied upon by the caller if
+ * reuse_swap_page() returns false, but it may be always overwritten
+ * (see the other implementation for CONFIG_SWAP=n).
  */
-int reuse_swap_page(struct page *page)
+bool reuse_swap_page(struct page *page, int *total_mapcount)
 {
 	int count;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
-		return 0;
-	/* The page is part of THP and cannot be reused */
-	if (PageTransCompound(page))
-		return 0;
-	count = page_mapcount(page);
+		return false;
+	count = page_trans_huge_mapcount(page, total_mapcount);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
 		if (count == 1 && !PageWriteback(page)) {



>From b3cd271859f4c8243b58b4b55998fcc9ee0a0988 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@...hat.com>
Date: Sat, 30 Apr 2016 18:35:34 +0200
Subject: [PATCH 3/4] mm: thp: microoptimize compound_mapcount()

compound_mapcount() is only called after PageCompound() has already
been checked by the caller, so there's no point to check it again. Gcc
may optimize it away too because it's inline but this will remove the
runtime check for sure and add it'll add an assert instead.

Signed-off-by: Andrea Arcangeli <aarcange@...hat.com>
---
 include/linux/mm.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4b532d8..119325d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -471,8 +471,7 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page)
 
 static inline int compound_mapcount(struct page *page)
 {
-	if (!PageCompound(page))
-		return 0;
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	page = compound_head(page);
 	return atomic_read(compound_mapcount_ptr(page)) + 1;
 }



>From a2f1172344b87b1b0e18d07014ee5ab2027fac10 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@...hat.com>
Date: Thu, 5 May 2016 00:59:27 +0200
Subject: [PATCH 4/4] mm: thp: split_huge_pmd_address() comment improvement

Comment is partly wrong, this improves it by including the case of
split_huge_pmd_address() called by try_to_unmap_one if
TTU_SPLIT_HUGE_PMD is set.

Signed-off-by: Andrea Arcangeli <aarcange@...hat.com>
---
 mm/huge_memory.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f8f07e4..e716726 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3032,8 +3032,10 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		return;
 
 	/*
-	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
-	 * materialize from under us.
+	 * Caller holds the mmap_sem write mode or the anon_vma lock,
+	 * so a huge pmd cannot materialize from under us (khugepaged
+	 * holds both the mmap_sem write mode and the anon_vma lock
+	 * write mode).
 	 */
 	__split_huge_pmd(vma, pmd, address, freeze);
 }


I also noticed we aren't calling page_move_anon_rmap in
do_huge_pmd_wp_page when page_trans_huge_mapcount returns 1, that's a
longstanding inefficiency but it's not a bug. We're not locking the
page down in the THP COW because we don't have to deal with swapcache,
and in turn we can't overwrite the page->mapping. I think in practice
it would be safe anyway because it's an atomic write and no matter if
the rmap_walk reader sees the value before or after the write, it'll
still be able to find the pmd_trans_huge during the rmap walk. However
if page->mapping can change under the reader (i.e. rmap_walk) then the
reader should use READ_ONCE to access page->mapping (or page->mapping
should become volatile). Otherwise it'd be a bug with the C standard
where gcc could get confused in theory (in practice it would work fine
as we're mostly just dereferencing that page->mapping pointer and not
using it for switch/case or stuff like that where gcc could use an
hash). Regardless for robustness it'd be better if we take appropriate
locking and so we should take the page lock by doing a check if the
page->mapping is already pointing to the local vma->anon_vma first, if
not then we should take the page lock on the head THP and call
page_move_anon_rmap. Because this is a longstanding problem I didn't
address it yet, and it's only a missing optimization but it'd be nice
to get that covered too (considering we just worsened a bit the
optimization in presence of a COW after a pmd split and before the
physical split).

Powered by blists - more mailing lists