[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50BA2E0A.2070102@redhat.com>
Date: Sat, 01 Dec 2012 11:19:22 -0500
From: Rik van Riel <riel@...hat.com>
To: Ingo Molnar <mingo@...nel.org>
CC: Linus Torvalds <torvalds@...ux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-mm <linux-mm@...ck.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Paul Turner <pjt@...gle.com>,
Lee Schermerhorn <Lee.Schermerhorn@...com>,
Christoph Lameter <cl@...ux.com>, Mel Gorman <mgorman@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Johannes Weiner <hannes@...xchg.org>,
Hugh Dickins <hughd@...gle.com>
Subject: Re: [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()
On 12/01/2012 04:49 AM, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@...ux-foundation.org> wrote:
>
>> On Fri, Nov 30, 2012 at 11:58 AM, Ingo Molnar <mingo@...nel.org> wrote:
>>>
>>> When pushed hard enough via threaded workloads (for example
>>> via the numa02 test) then the upstream page migration code
>>> in mm/migration.c becomes unscalable, resulting in lot of
>>> scheduling on the anon vma mutex and a subsequent drop in
>>> performance.
>>
>> Ugh.
>>
>> I wonder if migration really needs that thing to be a mutex? I
>> may be wrong, but the anon_vma lock only protects the actual
>> rmap chains, and migration only ever changes the pte
>> *contents*, not the actual chains of pte's themselves, right?
>>
>> So if this is a migration-specific scalability issue, then it
>> might be possible to solve by making the mutex be a rwsem
>> instead, and have migration only take it for reading.
>>
>> Of course, I'm quite possibly wrong, and the code depends on
>> full mutual exclusion.
>>
>> Just a thought, in case it makes somebody go "Hmm.."
>
> I *think* you are right that for this type of migration that we
> are using here we indeed don't need to take an exclusive vma
> lock - in fact I think we don't need to take it at all:
>
> The main goal in the migration code is to unmap the pte from all
> thread's MMU visibility, before we copy its contents into
> another page [located on another node] and map that page into
> the page tables instead of the old page.
>
> No other thread must have a write reference to the old page when
> the copying [migrate_page_copy()] is performed, or we corrupt
> user-space memory subtly via copying a slightly older version of
> user-space memory.
>
> rmap_walk() OTOH appears to have been written as a general
> purpose function, to be usable without holding the mmap_sem() as
> well, so it is written to protect against the disappearance of
> anon vmas.
>
> But ... in all upstream and NUMA-migration codepaths I could
> find - and AFAICS in all other page-migration codepaths as well,
> including sys_move_pages() - anon vmas cannot disappear from
> under us, because we are already holding the mmap_sem.
>
> [ Initially I assumed that swapout or filesystem code could
> somehow call this without holding the mmap sem - but could not
> find any such code path. ]
>
> So I think we could get away rather simply, with something like
> the (entirely and utterly untested!) patch below.
>
> But ... judging from the code my feeling is this can only be the
> first (and easiest) step:
>
> 1)
>
> This patch might solve the remapping (remove_migration_ptes()),
> but does not solve the anon-vma locking done in the first,
> unmapping step of pte-migration - which is done via
> try_to_unmap(): which is a generic VM function used by swapout
> too, so callers do not necessarily hold the mmap_sem.
>
> A new TTU flag might solve it although I detest flag-driven
> locking semantics with a passion:
>
> Splitting out unlocked versions of try_to_unmap_anon(),
> try_to_unmap_ksm(), try_to_unmap_file() and constructing an
> unlocked try_to_unmap() out of them, to be used by the migration
> code, would be the cleaner option.
>
> 2)
>
> Taking a process-global mutex 1024 times per 2MB was indeed very
> expensive - and lets assume that we manage to sort that out -
> but then we are AFAICS exposed to the next layer: the
> finegrained migrate_pages() model where the migration code
> flushes the TLB 512 times per 2MB to unmap and remap it again
> and again at 4K granularity ...
>
> Assuming the simpler patch goes fine I'll try to come up with
> something intelligent for the TLB flushing sub-problem too: we
> could in theory batch the migration TLB flushes as well, by
> first doing an array of 2MB granular unmaps, then copying up to
> 512x 4K pages, then doing the final 2MB granular [but still
> 4K-represented in the page tables] remap.
>
> 2MB granular TLB flushing is OK for these workloads, I can see
> that in +THP tests.
>
> I will keep you updated about how far I manage to get down this
> road.
>
> Thanks,
>
> Ingo
>
> ---------------------------->
> Subject: mm/migration: Don't lock anon vmas in rmap_walk_anon()
> From: Ingo Molnar <mingo@...nel.org>
> Date: Thu Nov 22 14:16:26 CET 2012
>
> rmap_walk_anon() appears to be too careful about locking the anon
> vma for its own good - since all callers are holding the mmap_sem
> no vma can go away from under us:
>
> - sys_move_pages() is doing down_read(&mm->mmap_sem) in the
> sys_move_pages() -> do_pages_move() -> do_move_page_to_node_array()
> code path, which then calls migrate_pages(pagelist), which then
> does unmap_and_move() for every page in the list, which does
> remove_migration_ptes() which calls rmap.c::try_to_unmap().
>
> - the NUMA migration code's migrate_misplaced_page(), which calls
> migrate_pages() ... try_to_unmap(), is holding the mm->mmap_sem
> read-locked by virtue of the low level page fault handler taking
> it before calling handle_mm_fault().
>
> Removing this lock removes a global mutex from the hot path of
> migration-happy threaded workloads which can cause pathological
> performance like this:
>
> 96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
> |
> --- perf_trace_sched_switch
> __schedule
> schedule
> schedule_preempt_disabled
> __mutex_lock_common.isra.6
> __mutex_lock_slowpath
> mutex_lock
> |
> |--50.61%-- rmap_walk
> | move_to_new_page
> | migrate_pages
> | migrate_misplaced_page
> | __do_numa_page.isra.69
> | handle_pte_fault
> | handle_mm_fault
> | __do_page_fault
> | do_page_fault
> | page_fault
> | __memset_sse2
> | |
> | --100.00%-- worker_thread
> | |
> | --100.00%-- start_thread
> |
> --49.39%-- page_lock_anon_vma
> try_to_unmap_anon
> try_to_unmap
> migrate_pages
> migrate_misplaced_page
> __do_numa_page.isra.69
> handle_pte_fault
> handle_mm_fault
> __do_page_fault
> do_page_fault
> page_fault
> __memset_sse2
> |
> --100.00%-- worker_thread
> start_thread
>
> Cc: Linus Torvalds <torvalds@...ux-foundation.org>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> Cc: Andrea Arcangeli <aarcange@...hat.com>
> Cc: Rik van Riel <riel@...hat.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Hugh Dickins <hughd@...gle.com>
> Not-Yet-Signed-off-by: Ingo Molnar <mingo@...nel.org>
> ---
> mm/rmap.c | 13 +++++--------
> 1 file changed, 5 insertions(+), 8 deletions(-)
>
> Index: linux/mm/rmap.c
> ===================================================================
> --- linux.orig/mm/rmap.c
> +++ linux/mm/rmap.c
> @@ -1686,6 +1686,9 @@ void __put_anon_vma(struct anon_vma *ano
> /*
> * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
> * Called by migrate.c to remove migration ptes, but might be used more later.
> + *
> + * Note: callers are expected to protect against anon vmas disappearing
> + * under us - by holding the mmap_sem read or write locked.
> */
> static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
> struct vm_area_struct *, unsigned long, void *), void *arg)
I am not convinced this is enough.
The same anonymous page could be mapped into multiple processes
that inherited memory from the same (grand)parent process.
Holding the mmap_sem for one process does not protect against
manipulations of the anon_vma chain by sibling, child, or parent
processes.
We may need to turn the anon_vma lock into a rwsem, like Linus
suggested.
--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists