lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200323234149.9FE95081@viggo.jf.intel.com>
Date:   Mon, 23 Mar 2020 16:41:49 -0700
From:   Dave Hansen <dave.hansen@...ux.intel.com>
To:     linux-kernel@...r.kernel.org
Cc:     Dave Hansen <dave.hansen@...ux.intel.com>, mhocko@...e.com,
        jannh@...gle.com, vbabka@...e.cz, minchan@...nel.org,
        dancol@...gle.com, joel@...lfernandes.org,
        akpm@...ux-foundation.org
Subject: [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages


From: Dave Hansen <dave.hansen@...ux.intel.com>

tl;dr: MADV_PAGEOUT ignores unmapped swap cache pages.  Enable
MADV_PAGEOUT to find and reclaim swap cache.

The long story:

Looking for another issue, I wrote a simple test which had two
processes: a parent and a fork()'d child.  The parent reads a
memory buffer shared by the fork() and the child calls
madvise(MADV_PAGEOUT) on the same buffer.

The first call to MADV_PAGEOUT does what is expected: it pages
the memory out and causes faults in the parent.  However, after
that, it does not cause any faults in the parent.  MADV_PAGEOUT
only works once!  This was a surprise.

The PTEs in the shared buffer start out pte_present()==1 in
both parent and child.  The first MADV_PAGEOUT operation replaces
those with pte_present()==0 swap PTEs.  The parent process
quickly faults and recreates pte_present()==1.  However, the
child process (the one calling MADV_PAGEOUT) never touches the
memory and has retained the non-present swap PTEs.

This situation could also happen in the case where a single
process had some of its data placed in the swap cache but where
the memory has not yet been reclaimed.

The MADV_PAGEOUT code has a pte_present()==0 check.  It will
essentially ignore any pte_present()==0 pages.  This essentially
makes unmapped swap cache immune from MADV_PAGEOUT, which is not
very friendly behavior.

Enable MADV_PAGEOUT to find and reclaim swap cache.  Because
swap cache is not pinned by holding the PTE lock, a reference
must be held until the page is isolated, where a second
reference is obtained.

Signed-off-by: Dave Hansen <dave.hansen@...ux.intel.com>
Cc: Michal Hocko <mhocko@...e.com>
Cc: Jann Horn <jannh@...gle.com>
Cc: Vlastimil Babka <vbabka@...e.cz>
Cc: Minchan Kim <minchan@...nel.org>
Cc: Daniel Colascione <dancol@...gle.com>
Cc: "Joel Fernandes (Google)" <joel@...lfernandes.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>
---

 b/mm/madvise.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 57 insertions(+), 11 deletions(-)

diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c
--- a/mm/madvise.c~madv-pageout-find-swap-cache	2020-03-23 16:30:48.505385896 -0700
+++ b/mm/madvise.c	2020-03-23 16:30:48.509385896 -0700
@@ -250,6 +250,52 @@ static void force_shm_swapin_readahead(s
 #endif		/* CONFIG_SWAP */
 
 /*
+ * Given a PTE, find the corresponding 'struct page'
+ * and acquire a reference.  Also handles non-present
+ * swap PTEs.
+ *
+ * Returns NULL when there is no page to reclaim.
+ */
+static struct page *pte_get_reclaim_page(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t ptent)
+{
+	swp_entry_t entry;
+	struct page *page;
+
+	/* Totally empty PTE: */
+	if (pte_none(ptent))
+		return NULL;
+
+	/* Handle present or PROT_NONE ptes: */
+	if (!is_swap_pte(ptent)) {
+		page = vm_normal_page(vma, addr, ptent);
+		if (page)
+			get_page(page);
+		return page;
+	}
+
+	/*
+	 * 'ptent' is now definitely a (non-present) swap
+	 * PTE in this process.  Go look for additional
+	 * references to the swap cache.
+	 */
+
+	/*
+	 * Is it one of the "swap PTEs" that's not really
+	 * swap?  Do not try to reclaim those.
+	 */
+	entry = pte_to_swp_entry(ptent);
+	if (non_swap_entry(entry))
+		return NULL;
+
+	/*
+	 * The PTE was a true swap entry.  The page may be in
+	 * the swap cache.
+	 */
+	return lookup_swap_cache(entry, vma, addr);
+}
+
+/*
  * Schedule all required I/O operations.  Do not wait for completion.
  */
 static long madvise_willneed(struct vm_area_struct *vma,
@@ -398,13 +444,8 @@ regular_page:
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (pte_none(ptent))
-			continue;
-
-		if (!pte_present(ptent))
-			continue;
-
-		page = vm_normal_page(vma, addr, ptent);
+		/* 'page' can be mapped, in the swap cache or both */
+		page = pte_get_reclaim_page(vma, addr, ptent);
 		if (!page)
 			continue;
 
@@ -413,9 +454,10 @@ regular_page:
 		 * are sure it's worth. Split it if we are only owner.
 		 */
 		if (PageTransCompound(page)) {
-			if (page_mapcount(page) != 1)
+			if (page_mapcount(page) != 1) {
+				put_page(page);
 				break;
-			get_page(page);
+			}
 			if (!trylock_page(page)) {
 				put_page(page);
 				break;
@@ -436,12 +478,14 @@ regular_page:
 		}
 
 		/* Do not interfere with other mappings of this page */
-		if (page_mapcount(page) != 1)
+		if (page_mapcount(page) != 1) {
+			put_page(page);
 			continue;
+		}
 
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
-		if (pte_young(ptent)) {
+		if (!is_swap_pte(ptent) && pte_young(ptent)) {
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			ptent = pte_mkold(ptent);
@@ -466,6 +510,8 @@ regular_page:
 			}
 		} else
 			deactivate_page(page);
+		/* drop ref acquired in pte_get_reclaim_page() */
+		put_page(page);
 	}
 
 	arch_leave_lazy_mmu_mode();
_

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ