[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55AE1285.4010600@suse.cz>
Date: Tue, 21 Jul 2015 11:36:05 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: David Rientjes <rientjes@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>
CC: Andrea Arcangeli <aarcange@...hat.com>,
Rik van Riel <riel@...hat.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [rfc] mm, thp: allow khugepaged to periodically compact memory
synchronously
On 07/15/2015 04:19 AM, David Rientjes wrote:
> We have seen a large benefit in the amount of hugepages that can be
> allocated at fault
That's understandable...
> and by khugepaged when memory is periodically
> compacted in the background.
... but for khugepaged it's surprising. Doesn't khugepaged (unlike page
faults) attempt the same sync compaction as your manual triggers?
> We trigger synchronous memory compaction over all memory every 15 minutes
> to keep fragmentation low and to offset the lightweight compaction that
> is done at page fault to keep latency low.
I'm surprised that 15 minutes is frequent enough to make a difference.
I'd expect it very much depends on the memory size and workload though.
> compact_sleep_millisecs controls how often khugepaged will compact all
> memory. Each scan_sleep_millisecs wakeup after this value has expired, a
> node is synchronously compacted until all memory has been scanned. Then,
> khugepaged will restart the process compact_sleep_millisecs later.
>
> This defaults to 0, which means no memory compaction is done.
Being another tunable and defaulting to 0 it means that most people
won't use it at all, or their distro will provide some other value. We
should really strive to make it self-tuning based on e.g. memory
fragmentation statistics. But I know that my kcompactd proposal also
wasn't quite there yet...
>
> Signed-off-by: David Rientjes <rientjes@...gle.com>
> ---
> RFC: this is for initial comment on whether it's appropriate to do this
> in khugepaged. We already do to the background compaction for the
> benefit of thp, but others may feel like this belongs in a new per-node
> kcompactd thread as proposed by Vlastimil.
>
> Regardless, it appears there is a substantial need for periodic memory
> compaction in the background to reduce the latency of thp page faults
> and still have a reasonable chance of having the allocation succeed.
>
> We could also speed up this process in the case of alloc_sleep_millisecs
> timeout since allocation recently failed for khugepaged.
>
> Documentation/vm/transhuge.txt | 10 +++++++
> mm/huge_memory.c | 65 ++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 75 insertions(+)
>
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -170,6 +170,16 @@ A lower value leads to gain less thp performance. Value of
> max_ptes_none can waste cpu time very little, you can
> ignore it.
>
> +/sys/kernel/mm/transparent_hugepage/khugepaged/compact_sleep_millisecs
> +
> +controls how often khugepaged will utilize memory compaction to defragment
> +memory. This makes it easier to allocate hugepages both at page fault and
> +by khugepaged since this compaction can be synchronous.
> +
> +This only occurs if scan_sleep_millisecs is configured. One node per
> +scan_sleep_millisecs wakeup is compacted when compact_sleep_millisecs
> +expires until all memory has been compacted.
> +
> == Boot parameter ==
>
> You can change the sysfs boot time defaults of Transparent Hugepage
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -23,6 +23,7 @@
> #include <linux/pagemap.h>
> #include <linux/migrate.h>
> #include <linux/hashtable.h>
> +#include <linux/compaction.h>
>
> #include <asm/tlb.h>
> #include <asm/pgalloc.h>
> @@ -65,6 +66,16 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
> */
> static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
>
> +/*
> + * Khugepaged may memory compaction over all memory at regular intervals.
> + * It round-robins through all nodes, compacting one at a time each
> + * scan_sleep_millisecs wakeup when triggered.
> + * May be set with compact_sleep_millisecs, which is disabled by default.
> + */
> +static unsigned long khugepaged_compact_sleep_millisecs __read_mostly;
> +static unsigned long khugepaged_compact_jiffies;
> +static int next_compact_node = MAX_NUMNODES;
> +
> static int khugepaged(void *none);
> static int khugepaged_slab_init(void);
> static void khugepaged_slab_exit(void);
> @@ -463,6 +474,34 @@ static struct kobj_attribute alloc_sleep_millisecs_attr =
> __ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
> alloc_sleep_millisecs_store);
>
> +static ssize_t compact_sleep_millisecs_show(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + char *buf)
> +{
> + return sprintf(buf, "%lu\n", khugepaged_compact_sleep_millisecs);
> +}
> +
> +static ssize_t compact_sleep_millisecs_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + unsigned long msecs;
> + int err;
> +
> + err = kstrtoul(buf, 10, &msecs);
> + if (err || msecs > ULONG_MAX)
> + return -EINVAL;
> +
> + khugepaged_compact_sleep_millisecs = msecs;
> + khugepaged_compact_jiffies = jiffies + msecs_to_jiffies(msecs);
> + wake_up_interruptible(&khugepaged_wait);
> +
> + return count;
> +}
> +static struct kobj_attribute compact_sleep_millisecs_attr =
> + __ATTR(compact_sleep_millisecs, 0644, compact_sleep_millisecs_show,
> + compact_sleep_millisecs_store);
> +
> static ssize_t pages_to_scan_show(struct kobject *kobj,
> struct kobj_attribute *attr,
> char *buf)
> @@ -564,6 +603,7 @@ static struct attribute *khugepaged_attr[] = {
> &full_scans_attr.attr,
> &scan_sleep_millisecs_attr.attr,
> &alloc_sleep_millisecs_attr.attr,
> + &compact_sleep_millisecs_attr.attr,
> NULL,
> };
>
> @@ -652,6 +692,10 @@ static int __init hugepage_init(void)
> return 0;
> }
>
> + if (khugepaged_compact_sleep_millisecs)
> + khugepaged_compact_jiffies = jiffies +
> + msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
> +
> err = start_stop_khugepaged();
> if (err)
> goto err_khugepaged;
> @@ -2834,6 +2878,26 @@ static void khugepaged_wait_work(void)
> wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
> }
>
> +static void khugepaged_compact_memory(void)
> +{
> + if (!khugepaged_compact_jiffies ||
> + time_before(jiffies, khugepaged_compact_jiffies))
> + return;
> +
> + get_online_mems();
> + if (next_compact_node == MAX_NUMNODES)
> + next_compact_node = first_node(node_states[N_MEMORY]);
> +
> + compact_pgdat(NODE_DATA(next_compact_node), -1);
> +
> + next_compact_node = next_node(next_compact_node, node_states[N_MEMORY]);
> + put_online_mems();
> +
> + if (next_compact_node == MAX_NUMNODES)
> + khugepaged_compact_jiffies = jiffies +
> + msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
> +}
> +
> static int khugepaged(void *none)
> {
> struct mm_slot *mm_slot;
> @@ -2842,6 +2906,7 @@ static int khugepaged(void *none)
> set_user_nice(current, MAX_NICE);
>
> while (!kthread_should_stop()) {
> + khugepaged_compact_memory();
> khugepaged_do_scan();
> khugepaged_wait_work();
> }
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists