[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210317083830.GC3881262@gmail.com>
Date: Wed, 17 Mar 2021 09:38:30 +0100
From: Ingo Molnar <mingo@...nel.org>
To: Nicholas Piggin <npiggin@...il.com>
Cc: linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
linux-mm@...ck.org, Anton Blanchard <anton@...abs.org>
Subject: Re: [PATCH v2] Increase page and bit waitqueue hash size
* Nicholas Piggin <npiggin@...il.com> wrote:
> The page waitqueue hash is a bit small (256 entries) on very big systems. A
> 16 socket 1536 thread POWER9 system was found to encounter hash collisions
> and excessive time in waitqueue locking at times. This was intermittent and
> hard to reproduce easily with the setup we had (very little real IO
> capacity). The theory is that sometimes (depending on allocation luck)
> important pages would happen to collide a lot in the hash, slowing down page
> locking, causing the problem to snowball.
>
> An small test case was made where threads would write and fsync different
> pages, generating just a small amount of contention across many pages.
>
> Increasing page waitqueue hash size to 262144 entries increased throughput
> by 182% while also reducing standard deviation 3x. perf before the increase:
>
> 36.23% [k] _raw_spin_lock_irqsave - -
> |
> |--34.60%--wake_up_page_bit
> | 0
> | iomap_write_end.isra.38
> | iomap_write_actor
> | iomap_apply
> | iomap_file_buffered_write
> | xfs_file_buffered_aio_write
> | new_sync_write
>
> 17.93% [k] native_queued_spin_lock_slowpath - -
> |
> |--16.74%--_raw_spin_lock_irqsave
> | |
> | --16.44%--wake_up_page_bit
> | iomap_write_end.isra.38
> | iomap_write_actor
> | iomap_apply
> | iomap_file_buffered_write
> | xfs_file_buffered_aio_write
>
> This patch uses alloc_large_system_hash to allocate a bigger system hash
> that scales somewhat with memory size. The bit/var wait-queue is also
> changed to keep code matching, albiet with a smaller scale factor.
>
> A very small CONFIG_BASE_SMALL option is also added because these are two
> of the biggest static objects in the image on very small systems.
>
> This hash could be made per-node, which may help reduce remote accesses
> on well localised workloads, but that adds some complexity with indexing
> and hotplug, so until we get a less artificial workload to test with,
> keep it simple.
>
> Signed-off-by: Nicholas Piggin <npiggin@...il.com>
> ---
> kernel/sched/wait_bit.c | 30 +++++++++++++++++++++++-------
> mm/filemap.c | 24 +++++++++++++++++++++---
> 2 files changed, 44 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
> index 02ce292b9bc0..dba73dec17c4 100644
> --- a/kernel/sched/wait_bit.c
> +++ b/kernel/sched/wait_bit.c
> @@ -2,19 +2,24 @@
> /*
> * The implementation of the wait_bit*() and related waiting APIs:
> */
> +#include <linux/memblock.h>
> #include "sched.h"
>
> -#define WAIT_TABLE_BITS 8
> -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
Ugh, 256 entries is almost embarrassingly small indeed.
I've put your patch into sched/core, unless Andrew is objecting.
> - for (i = 0; i < WAIT_TABLE_SIZE; i++)
> + if (!CONFIG_BASE_SMALL) {
> + bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
> + sizeof(wait_queue_head_t),
> + 0,
> + 22,
> + 0,
> + &bit_wait_table_bits,
> + NULL,
> + 0,
> + 0);
> + }
> + for (i = 0; i < BIT_WAIT_TABLE_SIZE; i++)
> init_waitqueue_head(bit_wait_table + i);
Meta suggestion: maybe the CONFIG_BASE_SMALL ugliness could be folded
into alloc_large_system_hash() itself?
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> static wait_queue_head_t *page_waitqueue(struct page *page)
> {
> - return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)];
> + return &page_wait_table[hash_ptr(page, page_wait_table_bits)];
> }
I'm wondering whether you've tried to make this NUMA aware through
page->node?
Seems like another useful step when having a global hash ...
Thanks,
Ingo
Powered by blists - more mailing lists