[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130205191102.GM21389@suse.de>
Date: Tue, 5 Feb 2013 19:11:02 +0000
From: Mel Gorman <mgorman@...e.de>
To: Hugh Dickins <hughd@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Petr Holasek <pholasek@...hat.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Izik Eidus <izik.eidus@...ellosystems.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH 7/11] ksm: make KSM page migration possible
On Fri, Jan 25, 2013 at 06:03:31PM -0800, Hugh Dickins wrote:
> KSM page migration is already supported in the case of memory hotremove,
> which takes the ksm_thread_mutex across all its migrations to keep life
> simple.
>
> But the new KSM NUMA merge_across_nodes knob introduces a problem, when
> it's set to non-default 0: if a KSM page is migrated to a different NUMA
> node, how do we migrate its stable node to the right tree? And what if
> that collides with an existing stable node?
>
> So far there's no provision for that, and this patch does not attempt
> to deal with it either. But how will I test a solution, when I don't
> know how to hotremove memory?
Just reach in and yank it straight out with a chisel.
> The best answer is to enable KSM page
> migration in all cases now, and test more common cases. With THP and
> compaction added since KSM came in, page migration is now mainstream,
> and it's a shame that a KSM page can frustrate freeing a page block.
>
THP will at least check if migration within a node works. It won't
necessarily check we can migrate across nodes properly but it's a lot
better than nothing.
> Without worrying about merge_across_nodes 0 for now, this patch gets
> KSM page migration working reliably for default merge_across_nodes 1
> (but leave the patch enabling it until near the end of the series).
>
> It's much simpler than I'd originally imagined, and does not require
> an additional tier of locking: page migration relies on the page lock,
> KSM page reclaim relies on the page lock, the page lock is enough for
> KSM page migration too.
>
> Almost all the care has to be in get_ksm_page(): that's the function
> which worries about when a stable node is stale and should be freed,
> now it also has to worry about the KSM page being migrated.
>
> The only new overhead is an additional put/get/lock/unlock_page when
> stable_tree_search() arrives at a matching node: to make sure migration
> respects the raised page count, and so does not migrate the page while
> we're busy with it here. That's probably avoidable, either by changing
> internal interfaces from using kpage to stable_node, or by moving the
> ksm_migrate_page() callsite into a page_freeze_refs() section (even if
> not swapcache); but this works well, I've no urge to pull it apart now.
>
> (Descents of the stable tree may pass through nodes whose KSM pages are
> under migration: being unlocked, the raised page count does not prevent
> that, nor need it: it's safe to memcmp against either old or new page.)
>
> You might worry about mremap, and whether page migration's rmap_walk
> to remove migration entries will find all the KSM locations where it
> inserted earlier: that should already be handled, by the satisfyingly
> heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).
>
> Signed-off-by: Hugh Dickins <hughd@...gle.com>
> ---
> mm/ksm.c | 94 ++++++++++++++++++++++++++++++++++++++-----------
> mm/migrate.c | 5 ++
> 2 files changed, 77 insertions(+), 22 deletions(-)
>
> --- mmotm.orig/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800
> +++ mmotm/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800
> @@ -499,6 +499,7 @@ static void remove_node_from_stable_tree
> * In which case we can trust the content of the page, and it
> * returns the gotten page; but if the page has now been zapped,
> * remove the stale node from the stable tree and return NULL.
> + * But beware, the stable node's page might be being migrated.
> *
> * You would expect the stable_node to hold a reference to the ksm page.
> * But if it increments the page's count, swapping out has to wait for
> @@ -509,44 +510,77 @@ static void remove_node_from_stable_tree
> * pointing back to this stable node. This relies on freeing a PageAnon
> * page to reset its page->mapping to NULL, and relies on no other use of
> * a page to put something that might look like our key in page->mapping.
> - *
> - * include/linux/pagemap.h page_cache_get_speculative() is a good reference,
> - * but this is different - made simpler by ksm_thread_mutex being held, but
> - * interesting for assuming that no other use of the struct page could ever
> - * put our expected_mapping into page->mapping (or a field of the union which
> - * coincides with page->mapping).
> - *
> - * Note: it is possible that get_ksm_page() will return NULL one moment,
> - * then page the next, if the page is in between page_freeze_refs() and
> - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
> * is on its way to being freed; but it is an anomaly to bear in mind.
> */
> static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
> {
> struct page *page;
> void *expected_mapping;
> + unsigned long kpfn;
>
> - page = pfn_to_page(stable_node->kpfn);
> expected_mapping = (void *)stable_node +
> (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
> - if (page->mapping != expected_mapping)
> - goto stale;
> - if (!get_page_unless_zero(page))
> +again:
> + kpfn = ACCESS_ONCE(stable_node->kpfn);
> + page = pfn_to_page(kpfn);
> +
Ok.
There should be no concern that hot-remove made the kpfn invalid because
those stable tree entries should have been discarded.
> + /*
> + * page is computed from kpfn, so on most architectures reading
> + * page->mapping is naturally ordered after reading node->kpfn,
> + * but on Alpha we need to be more careful.
> + */
> + smp_read_barrier_depends();
The value of page is data dependant on pfn_to_page(). Is it really possible
for that to be re-ordered even on Alpha?
> + if (ACCESS_ONCE(page->mapping) != expected_mapping)
> goto stale;
> - if (page->mapping != expected_mapping) {
> +
> + /*
> + * We cannot do anything with the page while its refcount is 0.
> + * Usually 0 means free, or tail of a higher-order page: in which
> + * case this node is no longer referenced, and should be freed;
> + * however, it might mean that the page is under page_freeze_refs().
> + * The __remove_mapping() case is easy, again the node is now stale;
> + * but if page is swapcache in migrate_page_move_mapping(), it might
> + * still be our page, in which case it's essential to keep the node.
> + */
> + while (!get_page_unless_zero(page)) {
> + /*
> + * Another check for page->mapping != expected_mapping would
> + * work here too. We have chosen the !PageSwapCache test to
> + * optimize the common case, when the page is or is about to
> + * be freed: PageSwapCache is cleared (under spin_lock_irq)
> + * in the freeze_refs section of __remove_mapping(); but Anon
> + * page->mapping reset to NULL later, in free_pages_prepare().
> + */
> + if (!PageSwapCache(page))
> + goto stale;
> + cpu_relax();
> + }
The recheck of stable_node->kpfn check after a barrier distinguishes between
a free and a completed migration, that's fine. I'm hesitate to ask because
it must be obvious but where is the guarantee that a KSM page is in the
swap cache?
> +
> + if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> put_page(page);
> goto stale;
> }
> +
> if (locked) {
> lock_page(page);
> - if (page->mapping != expected_mapping) {
> + if (ACCESS_ONCE(page->mapping) != expected_mapping) {
> unlock_page(page);
> put_page(page);
> goto stale;
> }
> }
> return page;
> +
> stale:
> + /*
> + * We come here from above when page->mapping or !PageSwapCache
> + * suggests that the node is stale; but it might be under migration.
> + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
> + * before checking whether node->kpfn has been changed.
> + */
> + smp_rmb();
> + if (ACCESS_ONCE(stable_node->kpfn) != kpfn)
> + goto again;
> remove_node_from_stable_tree(stable_node);
> return NULL;
> }
> @@ -1103,15 +1137,25 @@ static struct page *stable_tree_search(s
> return NULL;
>
> ret = memcmp_pages(page, tree_page);
> + put_page(tree_page);
>
> - if (ret < 0) {
> - put_page(tree_page);
> + if (ret < 0)
> node = node->rb_left;
> - } else if (ret > 0) {
> - put_page(tree_page);
> + else if (ret > 0)
> node = node->rb_right;
> - } else
> + else {
> + /*
> + * Lock and unlock the stable_node's page (which
> + * might already have been migrated) so that page
> + * migration is sure to notice its raised count.
> + * It would be more elegant to return stable_node
> + * than kpage, but that involves more changes.
> + */
> + tree_page = get_ksm_page(stable_node, true);
> + if (tree_page)
> + unlock_page(tree_page);
> return tree_page;
> + }
> }
>
> return NULL;
> @@ -1903,6 +1947,14 @@ void ksm_migrate_page(struct page *newpa
> if (stable_node) {
> VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
> stable_node->kpfn = page_to_pfn(newpage);
> + /*
> + * newpage->mapping was set in advance; now we need smp_wmb()
> + * to make sure that the new stable_node->kpfn is visible
> + * to get_ksm_page() before it can see that oldpage->mapping
> + * has gone stale (or that PageSwapCache has been cleared).
> + */
> + smp_wmb();
> + set_page_stable_node(oldpage, NULL);
> }
> }
> #endif /* CONFIG_MIGRATION */
> --- mmotm.orig/mm/migrate.c 2013-01-25 14:27:58.140193249 -0800
> +++ mmotm/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800
> @@ -464,7 +464,10 @@ void migrate_page_copy(struct page *newp
>
> mlock_migrate_page(newpage, page);
> ksm_migrate_page(newpage, page);
> -
> + /*
> + * Please do not reorder this without considering how mm/ksm.c's
> + * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
> + */
> ClearPageSwapCache(page);
> ClearPagePrivate(page);
> set_page_private(page, 0);
>
--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists