linux-kernel - [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121201094927.GA12366@gmail.com>
Date:	Sat, 1 Dec 2012 10:49:27 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Paul Turner <pjt@...gle.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Christoph Lameter <cl@...ux.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Johannes Weiner <hannes@...xchg.org>,
	Hugh Dickins <hughd@...gle.com>
Subject: [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()


* Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Fri, Nov 30, 2012 at 11:58 AM, Ingo Molnar <mingo@...nel.org> wrote:
> >
> > When pushed hard enough via threaded workloads (for example 
> > via the numa02 test) then the upstream page migration code 
> > in mm/migration.c becomes unscalable, resulting in lot of 
> > scheduling on the anon vma mutex and a subsequent drop in 
> > performance.
> 
> Ugh.
>
> I wonder if migration really needs that thing to be a mutex? I 
> may be wrong, but the anon_vma lock only protects the actual 
> rmap chains, and migration only ever changes the pte 
> *contents*, not the actual chains of pte's themselves, right?
> 
> So if this is a migration-specific scalability issue, then it 
> might be possible to solve by making the mutex be a rwsem 
> instead, and have migration only take it for reading.
> 
> Of course, I'm quite possibly wrong, and the code depends on 
> full mutual exclusion.
> 
> Just a thought, in case it makes somebody go "Hmm.."

I *think* you are right that for this type of migration that we 
are using here we indeed don't need to take an exclusive vma 
lock - in fact I think we don't need to take it at all:

The main goal in the migration code is to unmap the pte from all 
thread's MMU visibility, before we copy its contents into 
another page [located on another node] and map that page into 
the page tables instead of the old page.

No other thread must have a write reference to the old page when 
the copying [migrate_page_copy()] is performed, or we corrupt 
user-space memory subtly via copying a slightly older version of 
user-space memory.

rmap_walk() OTOH appears to have been written as a general 
purpose function, to be usable without holding the mmap_sem() as 
well, so it is written to protect against the disappearance of 
anon vmas.

But ... in all upstream and NUMA-migration codepaths I could 
find - and AFAICS in all other page-migration codepaths as well, 
including sys_move_pages() - anon vmas cannot disappear from 
under us, because we are already holding the mmap_sem.

[ Initially I assumed that swapout or filesystem code could 
  somehow call this without holding the mmap sem - but could not 
  find any such code path. ]

So I think we could get away rather simply, with something like 
the (entirely and utterly untested!) patch below.

But ... judging from the code my feeling is this can only be the 
first (and easiest) step:

1)

This patch might solve the remapping (remove_migration_ptes()), 
but does not solve the anon-vma locking done in the first, 
unmapping step of pte-migration - which is done via 
try_to_unmap(): which is a generic VM function used by swapout 
too, so callers do not necessarily hold the mmap_sem.

A new TTU flag might solve it although I detest flag-driven 
locking semantics with a passion:

Splitting out unlocked versions of try_to_unmap_anon(), 
try_to_unmap_ksm(), try_to_unmap_file() and constructing an 
unlocked try_to_unmap() out of them, to be used by the migration 
code, would be the cleaner option.

2)

Taking a process-global mutex 1024 times per 2MB was indeed very 
expensive - and lets assume that we manage to sort that out - 
but then we are AFAICS exposed to the next layer: the 
finegrained migrate_pages() model where the migration code 
flushes the TLB 512 times per 2MB to unmap and remap it again 
and again at 4K granularity ...

Assuming the simpler patch goes fine I'll try to come up with 
something intelligent for the TLB flushing sub-problem too: we 
could in theory batch the migration TLB flushes as well, by 
first doing an array of 2MB granular unmaps, then copying up to 
512x 4K pages, then doing the final 2MB granular [but still 
4K-represented in the page tables] remap.

2MB granular TLB flushing is OK for these workloads, I can see 
that in +THP tests.

I will keep you updated about how far I manage to get down this 
road.

Thanks,

	Ingo

---------------------------->
Subject: mm/migration: Don't lock anon vmas in rmap_walk_anon()
From: Ingo Molnar <mingo@...nel.org>
Date: Thu Nov 22 14:16:26 CET 2012

rmap_walk_anon() appears to be too careful about locking the anon
vma for its own good - since all callers are holding the mmap_sem
no vma can go away from under us:

 - sys_move_pages() is doing down_read(&mm->mmap_sem) in the
   sys_move_pages() -> do_pages_move() -> do_move_page_to_node_array()
   code path, which then calls migrate_pages(pagelist), which then
   does unmap_and_move() for every page in the list, which does
   remove_migration_ptes() which calls rmap.c::try_to_unmap().

 - the NUMA migration code's migrate_misplaced_page(), which calls
   migrate_pages() ... try_to_unmap(), is holding the mm->mmap_sem
   read-locked by virtue of the low level page fault handler taking
   it before calling handle_mm_fault().

Removing this lock removes a global mutex from the hot path of
migration-happy threaded workloads which can cause pathological
performance like this:

    96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                  |
                  --- perf_trace_sched_switch
                      __schedule
                      schedule
                      schedule_preempt_disabled
                      __mutex_lock_common.isra.6
                      __mutex_lock_slowpath
                      mutex_lock
                     |
                     |--50.61%-- rmap_walk
                     |          move_to_new_page
                     |          migrate_pages
                     |          migrate_misplaced_page
                     |          __do_numa_page.isra.69
                     |          handle_pte_fault
                     |          handle_mm_fault
                     |          __do_page_fault
                     |          do_page_fault
                     |          page_fault
                     |          __memset_sse2
                     |          |
                     |           --100.00%-- worker_thread
                     |                     |
                     |                      --100.00%-- start_thread
                     |
                      --49.39%-- page_lock_anon_vma
                                try_to_unmap_anon
                                try_to_unmap
                                migrate_pages
                                migrate_misplaced_page
                                __do_numa_page.isra.69
                                handle_pte_fault
                                handle_mm_fault
                                __do_page_fault
                                do_page_fault
                                page_fault
                                __memset_sse2
                                |
                                 --100.00%-- worker_thread
                                           start_thread

Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc: Andrea Arcangeli <aarcange@...hat.com>
Cc: Rik van Riel <riel@...hat.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Hugh Dickins <hughd@...gle.com>
Not-Yet-Signed-off-by: Ingo Molnar <mingo@...nel.org>
---
 mm/rmap.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c
+++ linux/mm/rmap.c
@@ -1686,6 +1686,9 @@ void __put_anon_vma(struct anon_vma *ano
 /*
  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
  * Called by migrate.c to remove migration ptes, but might be used more later.
+ *
+ * Note: callers are expected to protect against anon vmas disappearing
+ *       under us - by holding the mmap_sem read or write locked.
  */
 static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
@@ -1695,16 +1698,10 @@ static int rmap_walk_anon(struct page *p
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-	/*
-	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
-	 * because that depends on page_mapped(); but not all its usages
-	 * are holding mmap_sem. Users without mmap_sem are required to
-	 * take a reference count to prevent the anon_vma disappearing
-	 */
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
-	anon_vma_lock(anon_vma);
+
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
@@ -1712,7 +1709,7 @@ static int rmap_walk_anon(struct page *p
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-	anon_vma_unlock(anon_vma);
+
 	return ret;
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/