linux-kernel - [RFC PATCH] mm/migration: Remove anon vma locking from try_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 1 Dec 2012 13:26:49 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Paul Turner <pjt@...gle.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Christoph Lameter <cl@...ux.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Johannes Weiner <hannes@...xchg.org>,
	Hugh Dickins <hughd@...gle.com>
Subject: [RFC PATCH] mm/migration: Remove anon vma locking from
 try_to_unmap() use


* Ingo Molnar <mingo@...nel.org> wrote:

> 1)
> 
> This patch might solve the remapping 
> (remove_migration_ptes()), but does not solve the anon-vma 
> locking done in the first, unmapping step of pte-migration - 
> which is done via try_to_unmap(): which is a generic VM 
> function used by swapout too, so callers do not necessarily 
> hold the mmap_sem.
> 
> A new TTU flag might solve it although I detest flag-driven 
> locking semantics with a passion:
> 
> Splitting out unlocked versions of try_to_unmap_anon(), 
> try_to_unmap_ksm(), try_to_unmap_file() and constructing an 
> unlocked try_to_unmap() out of them, to be used by the 
> migration code, would be the cleaner option.

So as a quick concept hack I wrote the patch attached below. 
(It's not signed off, see the patch description text for the 
reason.)

With this applied I get the same good 4x JVM performance:

     spec1.txt:           throughput =     157471.10 SPECjbb2005 bops 
     spec2.txt:           throughput =     157817.09 SPECjbb2005 bops 
     spec3.txt:           throughput =     157581.79 SPECjbb2005 bops 
     spec4.txt:           throughput =     157890.26 SPECjbb2005 bops 
                                           --------------------------
           SUM:           throughput =     630760.24 SPECjbb2005 bops

... because the JVM workload did not trigger the migration 
scalability threshold to begin with.

Mainline 4xJVM SPECjbb performance:

     spec1.txt:           throughput =     128575.47 SPECjbb2005 bops
     spec2.txt:           throughput =     125767.24 SPECjbb2005 bops
     spec3.txt:           throughput =     130042.30 SPECjbb2005 bops
     spec4.txt:           throughput =     128155.32 SPECjbb2005 bops
                                       --------------------------
           SUM:           throughput =     512540.33 SPECjbb2005 bops

     # (32 CPUs, 4 instances, 8 warehouses each, 240 seconds runtime, !THP)

But !THP/4K numa02 performance went trough the roof!

Mainline !THP numa02 performance:

         40.918 secs runtime/thread
         26.051 secs fastest (min) thread time
         59.229 secs elapsed (max) thread time [ spread: -28.0% ]
         26.844 GB data processed, per thread
        858.993 GB data processed, total
          2.206 nsecs/byte/thread
          0.453 GB/sec/thread
         14.503 GB/sec total

numa/core v18 + migration-locking-enhancements, !THP:

         18.543 secs runtime/thread
         17.721 secs fastest (min) thread time
         19.262 secs elapsed (max) thread time [ spread: -4.0% ]
         26.844 GB data processed, per thread
        858.993 GB data processed, total
          0.718 nsecs/byte/thread
          1.394 GB/sec/thread
         44.595 GB/sec total

as you can see the performance of each of the 32 threads is 
within a tight bound:

         17.721 secs fastest (min) thread time
         19.262 secs elapsed (max) thread time [ spread: -4.0% ]

... with very little spread between them.

So this is roughly as good as it can get without hard binding - 
and according to my limited testing the numa02 workload is 
20-30% faster than the AutoNUMA or balancenuma kernels on the 
same hardware/kernel combo. The above numa02 result now also 
gets reasonably close to the numa/core +THP numa02 numbers (to 
within 10%).

As expected there's a lot of TLB flushing going on, but, and 
this was unexpected to me, even maximally pushing the migration 
code does not trigger anything pathological on this 4-node 
system - so while the TLB optimization will be a welcome 
enhancement, it's not a must-have at this stage.

I'll do a cleaner version of this patch and I'll test on a 
larger system with a large NUMA factor too to make sure we don't 
need the TLB optimization on !THP.

So I think (assuming that I have not overlooked something 
critical in these patches!), with these two fixes all the 
difficult known regressions in numa/core are fixed.

I'll do more testing with broader workloads and on more systems 
to ascertain this.

Thanks,

	Ingo

---------------->
Subject: mm/migration: Remove anon vma locking from try_to_unmap() use
From: Ingo Molnar <mingo@...nel.org>
Date: Sat Dec 1 11:22:09 CET 2012

As outlined in:

    mm/migration: Don't lock anon vmas in rmap_walk_anon()

the process-global anon vma mutex locking of the page migration
code can be very expensive.

This removes the second (and last) use of that mutex from the
migration code: try_to_unmap().

Since try_to_unmap() is used by swapout and filesystem code
as well, which does not hold the mmap_sem, we only want to
do this optimization from the migration path.

This patch is ugly and should be replaced via a
try_to_unmap_locked() variant instead which offers us the
unlocked codepath, but it's good enough for testing purposes.

Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc: Andrea Arcangeli <aarcange@...hat.com>
Cc: Rik van Riel <riel@...hat.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Hugh Dickins <hughd@...gle.com>
Not-Signed-off-by: Ingo Molnar <mingo@...nel.org>
---
 include/linux/rmap.h |    2 +-
 mm/huge_memory.c     |    2 +-
 mm/memory-failure.c  |    2 +-
 mm/rmap.c            |   13 ++++++++++---
 4 files changed, 13 insertions(+), 6 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h
+++ linux/include/linux/rmap.h
@@ -220,7 +220,7 @@ int try_to_munlock(struct page *);
 /*
  * Called by memory-failure.c to kill processes.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page);
+struct anon_vma *page_lock_anon_vma(struct page *page, enum ttu_flags flags);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
Index: linux/mm/huge_memory.c
===================================================================
--- linux.orig/mm/huge_memory.c
+++ linux/mm/huge_memory.c
@@ -1645,7 +1645,7 @@ int split_huge_page(struct page *page)
 	int ret = 1;
 
 	BUG_ON(!PageAnon(page));
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma(page, 0);
 	if (!anon_vma)
 		goto out;
 	ret = 0;
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -402,7 +402,7 @@ static void collect_procs_anon(struct pa
 	struct anon_vma *av;
 	pgoff_t pgoff;
 
-	av = page_lock_anon_vma(page);
+	av = page_lock_anon_vma(page, 0);
 	if (av == NULL)	/* Not actually mapped anymore */
 		return;
 
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c
+++ linux/mm/rmap.c
@@ -442,7 +442,7 @@ out:
  * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
  * reference like with page_get_anon_vma() and then block on the mutex.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma = NULL;
 	struct anon_vma *root_anon_vma;
@@ -456,6 +456,13 @@ struct anon_vma *page_lock_anon_vma(stru
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
+	/*
+	 * The migration code paths are already holding the mmap_sem,
+	 * so the anon vma cannot go away from under us - return it:
+	 */
+	if (flags & TTU_MIGRATION)
+		goto out;
+
 	root_anon_vma = ACCESS_ONCE(anon_vma->root);
 	if (mutex_trylock(&root_anon_vma->mutex)) {
 		/*
@@ -732,7 +739,7 @@ static int page_referenced_anon(struct p
 	struct anon_vma_chain *avc;
 	int referenced = 0;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma(page, 0);
 	if (!anon_vma)
 		return referenced;
 
@@ -1474,7 +1481,7 @@ static int try_to_unmap_anon(struct page
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma(page, flags);
 	if (!anon_vma)
 		return ret;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/