[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161221223056.17c37dd6@roar.ozlabs.ibm.com>
Date: Wed, 21 Dec 2016 22:30:56 +1000
From: Nicholas Piggin <npiggin@...il.com>
To: Dave Hansen <dave.hansen@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
agruenba@...hat.com, rpeterso@...hat.com,
mgorman@...hsingularity.net, peterz@...radead.org, luto@...nel.org,
swhiteho@...hat.com, torvalds@...ux-foundation.org
Subject: Re: [RFC][PATCH] make global bitlock waitqueues per-node
On Mon, 19 Dec 2016 14:58:26 -0800
Dave Hansen <dave.hansen@...ux.intel.com> wrote:
> I saw a 4.8->4.9 regression (details below) that I attributed to:
>
> 9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
>
> That commit took the bitlock waitqueues from being dynamically-allocated
> per-zone to being statically allocated and global. As suggested by
> Linus, this makes them per-node, but keeps them statically-allocated.
>
> It leaves us with more waitqueues than the global approach, inherently
> scales it up as we gain nodes, and avoids generating code for
> page_zone() which was evidently quite ugly. The patch is pretty darn
> tiny too.
>
> This turns what was a ~40% 4.8->4.9 regression into a 17% gain over
> what on 4.8 did. That gain is a _bit_ surprising, but not entirely
> unexpected since we now get much simpler code from no page_zone() and a
> fixed-size array for which we don't have to follow a pointer (and get to
> do power-of-2 math).
>
> This boots in a small VM and on a multi-node NUMA system, but has not
> been tested widely. It definitely needs to be validated to make sure
> it properly initialzes the new structure in the various node hotplug
> cases before anybody goes applying it.
>
> Original report below (sent privately by accident).
>
> include/linux/mmzone.h | 5 +++++
> kernel/sched/core.c | 16 ----------------
> mm/filemap.c | 22 +++++++++++++++++++++-
> mm/page_alloc.c | 5 +++++
> 4 files changed, 31 insertions(+), 17 deletions(-)
>
> ---
>
> I'm seeing a 4.8->4.9 regression:
>
> https://www.sr71.net/~dave/intel/2016-12-19-page_fault_3.processes.png
>
> This is on a 160-thread 8-socket system. The workloads is a bunch of
> processes faulting in pages to separate MAP_SHARED mappings:
>
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>
> The smoking gun in the profiles is that __wake_up_bit() (called via
> unlock_page()) goes from ~1% to 4% in the profiles.
>
> The workload is pretty maniacally touching random parts of the
> waitqueue, and missing the cache heavily. Now that it is shared, I
> suspect we're transferring cachelines across node boundaries in a way
> that we were not with the per-zone waitqueues.
>
> This is *only* showing up with MAP_SHARED pages, not with anonymous
> pages. I think we do a lock_page()/unlock_page() pair in
> do_shared_fault(), which we avoid in the anonymous case. Reverting:
>
> 9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
>
> restores things to 4.8 behavior. The fact that automated testing didn't
> catch this probably means that it's pretty rare to find hardware that
> actually shows the problem, so I don't think it's worth reverting
> anything in mainline.
I've been doing a bit of testing, and I don't know why you're seeing
this.
I don't think I've been able to trigger any actual page lock contention
so nothing gets put on the waitqueue to really bounce cache lines around
that I can see. Yes there are some loads to check the waitqueue, but
those should be cached read shared among CPUs so not cause interconnect
traffic I would have thought.
After testing my PageWaiters patch, I'm maybe seeing 2% improvment on
this workload (although powerpc doesn't do so well on this one due to
virtual memory management overheads -- maybe x86 will show a bit more)
But the point is that after this, there should be no waitqueue activity
at all. I haven't chased up a system with a lot of IOPS connected that
would be needed to realistically test contended cases (which I thought
I needed to show an improvement from per-node tables, but your test
suggests not...)
Thanks,
Nick
>
> ut, the commit says:
>
> > As part of that earlier discussion, we had a much better solution for
> > the NUMA scalability issue - by just making the page lock have a
> > separate contention bit, the waitqueue doesn't even have to be looked at
> > for the normal case.
>
> So, maybe we should do that moving forward since we at least found one
> case that's pretty aversely affected.
>
> Cc: Andreas Gruenbacher <agruenba@...hat.com>
> Cc: Bob Peterson <rpeterso@...hat.com>
> Cc: Mel Gorman <mgorman@...hsingularity.net>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Andy Lutomirski <luto@...nel.org>
> Cc: Steven Whitehouse <swhiteho@...hat.com>
> Cc: Linus Torvalds <torvalds@...ux-foundation.org>
>
> ---
>
> b/include/linux/mmzone.h | 5 +++++
> b/kernel/sched/core.c | 16 ----------------
> b/mm/filemap.c | 22 +++++++++++++++++++++-
> b/mm/page_alloc.c | 5 +++++
> 4 files changed, 31 insertions(+), 17 deletions(-)
>
> diff -puN include/linux/mmzone.h~static-per-zone-waitqueue include/linux/mmzone.h
> --- a/include/linux/mmzone.h~static-per-zone-waitqueue 2016-12-19 11:35:12.210823059 -0800
> +++ b/include/linux/mmzone.h 2016-12-19 13:11:53.335170271 -0800
> @@ -27,6 +27,9 @@
> #endif
> #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
>
> +#define WAIT_TABLE_BITS 8
> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> +
> /*
> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
> * costly to service. That is between allocation orders which should
> @@ -662,6 +665,8 @@ typedef struct pglist_data {
> unsigned long min_slab_pages;
> #endif /* CONFIG_NUMA */
>
> + wait_queue_head_t wait_table[WAIT_TABLE_SIZE];
> +
> /* Write-intensive fields used by page reclaim */
> ZONE_PADDING(_pad1_)
> spinlock_t lru_lock;
> diff -puN kernel/sched/core.c~static-per-zone-waitqueue kernel/sched/core.c
> --- a/kernel/sched/core.c~static-per-zone-waitqueue 2016-12-19 11:35:12.212823149 -0800
> +++ b/kernel/sched/core.c 2016-12-19 11:35:12.225823738 -0800
> @@ -7509,27 +7509,11 @@ static struct kmem_cache *task_group_cac
> DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
>
> -#define WAIT_TABLE_BITS 8
> -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> -static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
> -
> -wait_queue_head_t *bit_waitqueue(void *word, int bit)
> -{
> - const int shift = BITS_PER_LONG == 32 ? 5 : 6;
> - unsigned long val = (unsigned long)word << shift | bit;
> -
> - return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
> -}
> -EXPORT_SYMBOL(bit_waitqueue);
> -
> void __init sched_init(void)
> {
> int i, j;
> unsigned long alloc_size = 0, ptr;
>
> - for (i = 0; i < WAIT_TABLE_SIZE; i++)
> - init_waitqueue_head(bit_wait_table + i);
> -
> #ifdef CONFIG_FAIR_GROUP_SCHED
> alloc_size += 2 * nr_cpu_ids * sizeof(void **);
> #endif
> diff -puN mm/filemap.c~static-per-zone-waitqueue mm/filemap.c
> --- a/mm/filemap.c~static-per-zone-waitqueue 2016-12-19 11:35:12.215823285 -0800
> +++ b/mm/filemap.c 2016-12-19 14:51:23.881814379 -0800
> @@ -779,6 +779,26 @@ EXPORT_SYMBOL(__page_cache_alloc);
> #endif
>
> /*
> + * We need 'nid' because page_waitqueue() needs to get the waitqueue
> + * for memory where virt_to_page() does not work, like highmem.
> + */
> +static wait_queue_head_t *__bit_waitqueue(void *word, int bit, int nid)
> +{
> + const int shift = BITS_PER_LONG == 32 ? 5 : 6;
> + unsigned long val = (unsigned long)word << shift | bit;
> +
> + return &NODE_DATA(nid)->wait_table[hash_long(val, WAIT_TABLE_BITS)];
> +}
> +
> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> +{
> + const int __maybe_unused nid = page_to_nid(virt_to_page(word));
> +
> + return __bit_waitqueue(word, bit, nid);
> +}
> +EXPORT_SYMBOL(bit_waitqueue);
> +
> +/*
> * In order to wait for pages to become available there must be
> * waitqueues associated with pages. By using a hash table of
> * waitqueues where the bucket discipline is to maintain all
> @@ -790,7 +810,7 @@ EXPORT_SYMBOL(__page_cache_alloc);
> */
> wait_queue_head_t *page_waitqueue(struct page *page)
> {
> - return bit_waitqueue(page, 0);
> + return __bit_waitqueue(page, 0, page_to_nid(page));
> }
> EXPORT_SYMBOL(page_waitqueue);
>
> diff -puN mm/page_alloc.c~static-per-zone-waitqueue mm/page_alloc.c
> --- a/mm/page_alloc.c~static-per-zone-waitqueue 2016-12-19 11:35:12.219823466 -0800
> +++ b/mm/page_alloc.c 2016-12-19 13:10:56.587613213 -0800
> @@ -5872,6 +5872,7 @@ void __paginginit free_area_init_node(in
> pg_data_t *pgdat = NODE_DATA(nid);
> unsigned long start_pfn = 0;
> unsigned long end_pfn = 0;
> + int i;
>
> /* pg_data_t should be reset to zero when it's allocated */
> WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
> @@ -5892,6 +5893,10 @@ void __paginginit free_area_init_node(in
> zones_size, zholes_size);
>
> alloc_node_mem_map(pgdat);
> +
> + /* per-node page waitqueue initialization: */
> + for (i = 0; i < WAIT_TABLE_SIZE; i++)
> + init_waitqueue_head(&pgdat->wait_table[i]);
> #ifdef CONFIG_FLAT_NODE_MEM_MAP
> printk(KERN_DEBUG "free_area_init_node: node %d, pgdat %08lx, node_mem_map %08lx\n",
> nid, (unsigned long)pgdat,
> _
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
Powered by blists - more mailing lists