linux-kernel - Re: [PATCH 1/1] mm: thp: Set THP defrag by default to madvise and add a stall-free defrag option

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56D58E1E.5090708@suse.cz>
Date:	Tue, 1 Mar 2016 13:42:06 +0100
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Mel Gorman <mgorman@...hsingularity.net>,
	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/1] mm: thp: Set THP defrag by default to madvise and add
 a stall-free defrag option

On 02/26/2016 05:15 PM, Mel Gorman wrote:
> Changelog since v1
> o Default defrag to madvise instead of never
> o Introduce "defer" defrag option to wake kswapd/kcompact if THP is unavailable
> o Restore "always" to historical behaviour
> o Update documentation
>
> THP defrag is enabled by default to direct reclaim/compact but not wake
> kswapd in the event of a THP allocation failure. The problem is that THP
> allocation requests potentially enter reclaim/compaction. This potentially
> incurs a severe stall that is not guaranteed to be offset by reduced TLB
> misses. While there has been considerable effort to reduce the impact
> of reclaim/compaction, it is still a high cost and workloads that should
> fit in memory fail to do so. Specifically, a simple anon/file streaming
> workload will enter direct reclaim on NUMA at least even though the working
> set size is 80% of RAM. It's been years and it's time to throw in the towel.
>
> First, this patch defines THP defrag as follows;
>
> madvise: A failed allocation will direct reclaim/compact if the application requests it
> never:   Neither reclaim/compact nor wake kswapd
> defer:   A failed allocation will wake kswapd/kcompactd
> always:  A failed allocation will direct reclaim/compact (historical behaviour)
> khugepaged defrag will enter direct/reclaim but not wake kswapd.
>
> Next it sets the default defrag option to be "madvise" to only enter direct
> reclaim/compaction for applications that specifically requested it.
>
> Lastly, it removes a check from the page allocator slowpath that is related
> to __GFP_THISNODE to allow "defer" to work. The callers that really cares are
> slub/slab and they are updated accordingly. The slab one may be surprising
> because it also corrects a comment as kswapd was never woken up by that path.

It would be also nice if we could remove the is_thp_gfp_mask() checks one day. 
They try to make direct reclaim/compaction for THP less intrusive, but maybe 
could be removed now that the stalls are limited otherwise. But that's out of 
scope here.

[...]

> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index 8a282687ee06..a19b173cbc57 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -113,9 +113,26 @@ guaranteed, but it may be more likely in case the allocation is for a
>   MADV_HUGEPAGE region.
>
>   echo always >/sys/kernel/mm/transparent_hugepage/defrag
> +echo defer >/sys/kernel/mm/transparent_hugepage/defrag
>   echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>   echo never >/sys/kernel/mm/transparent_hugepage/defrag
>
> +"always" means that an application requesting THP will stall on allocation
> +failure and directly reclaim pages and compact memory in an effort to
> +allocate a THP immediately. This may be desirable for virtual machines
> +that benefit heavily from THP use and are willing to delay the VM start
> +to utilise them.
> +
> +"defer" means that an application will wake kswapd in the background
> +to reclaim pages and wake kcompact to compact memory so that THP is
> +available in the near future. It's the responsibility of khugepaged
> +to then install the THP pages later.
> +
> +"madvise" will enter direct reclaim like "always" but only for regions
> +that are have used madvise(). This is the default behaviour.

"madvise(MADV_HUGEPAGE)" perhaps?

[...]

> @@ -277,17 +273,23 @@ static ssize_t double_flag_store(struct kobject *kobj,
>   static ssize_t enabled_show(struct kobject *kobj,
>   			    struct kobj_attribute *attr, char *buf)
>   {
> -	return double_flag_show(kobj, attr, buf,
> -				TRANSPARENT_HUGEPAGE_FLAG,
> -				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +	if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags)) {
> +		VM_BUG_ON(test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags));
> +		return sprintf(buf, "[always] madvise never\n");
> +	} else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags))
> +		return sprintf(buf, "always [madvise] never\n");
> +	else
> +		return sprintf(buf, "always madvise [never]\n");

Somewhat ugly wrt consistent usage of { }, due to the VM_BUG_ON(), which I would 
just drop. Also I wonder if some racy read vs write of the file can trigger the 
BUG_ON? Or are the kobject accesses synchronized at a higher level?

>   }
> +
>   static ssize_t enabled_store(struct kobject *kobj,
>   			     struct kobj_attribute *attr,
>   			     const char *buf, size_t count)
>   {
>   	ssize_t ret;
>
> -	ret = double_flag_store(kobj, attr, buf, count,
> +	ret = triple_flag_store(kobj, attr, buf, count,
> +				TRANSPARENT_HUGEPAGE_FLAG,
>   				TRANSPARENT_HUGEPAGE_FLAG,
>   				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);

So this makes "echo defer > enabled" behave just like "echo always"? For 
userspace interface that becomes a fixed ABI, I would prefer to be more careful 
with unintended aliases like this. Maybe pass something like "-1" that 
triple_flag_store() would check in the "defer" case and return -EINVAL?

> @@ -345,16 +347,23 @@ static ssize_t single_flag_store(struct kobject *kobj,
>   static ssize_t defrag_show(struct kobject *kobj,
>   			   struct kobj_attribute *attr, char *buf)
>   {
> -	return double_flag_show(kobj, attr, buf,
> -				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> -				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> +		return sprintf(buf, "[always] defer madvise never\n");
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> +		return sprintf(buf, "always [defer] madvise never\n");
> +	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> +		return sprintf(buf, "always defer [madvise] never\n");
> +	else
> +		return sprintf(buf, "always defer madvise [never]\n");
> +
>   }
>   static ssize_t defrag_store(struct kobject *kobj,
>   			    struct kobj_attribute *attr,
>   			    const char *buf, size_t count)
>   {
> -	return double_flag_store(kobj, attr, buf, count,
> -				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +	return triple_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
>   				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
>   }
>   static struct kobj_attribute defrag_attr =
> @@ -784,9 +793,30 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>   	return 0;
>   }
>
> -static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
> +/*
> + * If THP is set to always then directly reclaim/compact as necessary
> + * If set to defer then do no reclaim and defer to khugepaged
> + * If set to madvise and the VMA is flagged then directly reclaim/compact
> + */
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +{
> +	gfp_t reclaim_flags = 0;
> +
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags) &&
> +	    (vma->vm_flags & VM_HUGEPAGE))
> +		reclaim_flags = __GFP_DIRECT_RECLAIM;
> +	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> +		reclaim_flags = __GFP_KSWAPD_RECLAIM;
> +	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> +		reclaim_flags = __GFP_DIRECT_RECLAIM;

Hmm, here's a trick question. What if I wanted direct reclaim for madvise() 
vma's and kswapd/kcompactd for others? Right now there's no such option, right? 
And expressing that with different values for a single tunable becomes ugly...

(For completeness, somebody could want kswapd/kcompactd defrag only for 
madvise() but that's arguably not much useful)