netdev - Re: [PATCH mmotm] mm: alloc_large_system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.64.0905011509460.28876@blonde.anvils>
Date:	Fri, 1 May 2009 15:28:47 +0100 (BST)
From:	Hugh Dickins <hugh@...itas.com>
To:	Mel Gorman <mel@....ul.ie>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Andi Kleen <andi@...stfloor.org>,
	David Miller <davem@...emloft.net>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH mmotm] mm: alloc_large_system_hash check order

On Fri, 1 May 2009, Mel Gorman wrote:
> On Fri, May 01, 2009 at 12:30:03PM +0100, Hugh Dickins wrote:
> > 
> > Andrew noticed another oddity: that if it goes the hashdist __vmalloc()
> > way, it won't be limited by MAX_ORDER.  Makes one wonder whether it
> > ought to fall back to __vmalloc() if the alloc_pages_exact() fails.
> 
> I don't believe so. __vmalloc() is only used when hashdist= is used
> or on IA-64 (according to the documentation).

Doc out of date, hashdist's default "on" was extended to include
x86_64 ages ago, and to all 64-bit in 2.6.30-rc.

> It is used in the case that the caller is
> willing to deal with the vmalloc() overhead (e.g. using base page PTEs) in
> exchange for the pages being interleaved on different nodes so that access
> to the hash table has average performance[*]
> 
> If we automatically fell back to vmalloc(), I bet 2c we'd eventually get
> a mysterious performance regression report for a workload that depended on
> the hash tables performance but that there was enough memory for the hash
> table to be allocated with vmalloc() instead of alloc_pages_exact().
> 
> [*] I speculate that on non-IA64 NUMA machines that we see different
>     performance for large filesystem benchmarks depending on whether we are
>     running on the boot-CPU node or not depending on whether hashdist=
>     is used or not.

Now that will be "32bit NUMA machines".  I was going to say that's
a tiny sample, but I'm probably out of touch.  I thought NUMA-Q was
on its way out, but see it still there in the tree.  And presumably
nowadays there's a great swing to NUMA on Arm or netbooks or something.

> 
> > I think that's a change we could make _if_ the large_system_hash
> > users ever ask for it, but _not_ one we should make surreptitiously.
> > 
> 
> If they want it, they'll have to ask with hashdist=.

That's quite a good argument for taking it out from under CONFIG_NUMA.
The name "hashdist" would then be absurd, but we could delight our
grandchildren with the story of how it came to be so named.

> Somehow I doubt it's specified very often :/ .

Our intuitions match!  Which is probably why it got extended.

> 
> Here is Take 2
> 
> ==== CUT HERE ====
> 
> Use alloc_pages_exact() in alloc_large_system_hash() to avoid duplicated logic V2
> 
> alloc_large_system_hash() has logic for freeing pages at the end
> of an excessively large power-of-two buffer that is a duplicate of what
> is in alloc_pages_exact(). This patch converts alloc_large_system_hash()
> to use alloc_pages_exact().
> 
> Signed-off-by: Mel Gorman <mel@....ul.ie>

Acked-by: Hugh Dickins <hugh@...itas.com>

> --- 
>  mm/page_alloc.c |   21 ++++-----------------
>  1 file changed, 4 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1b3da0f..8360d59 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4756,26 +4756,13 @@ void *__init alloc_large_system_hash(const char *tablename,
>  		else if (hashdist)
>  			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
>  		else {
> -			unsigned long order = get_order(size);
> -
> -			if (order < MAX_ORDER)
> -				table = (void *)__get_free_pages(GFP_ATOMIC,
> -								order);
>  			/*
>  			 * If bucketsize is not a power-of-two, we may free
> -			 * some pages at the end of hash table.
> +			 * some pages at the end of hash table which
> +			 * alloc_pages_exact() automatically does
>  			 */
> -			if (table) {
> -				unsigned long alloc_end = (unsigned long)table +
> -						(PAGE_SIZE << order);
> -				unsigned long used = (unsigned long)table +
> -						PAGE_ALIGN(size);
> -				split_page(virt_to_page(table), order);
> -				while (used < alloc_end) {
> -					free_page(used);
> -					used += PAGE_SIZE;
> -				}
> -			}
> +			if (get_order(size) < MAX_ORDER)
> +				table = alloc_pages_exact(size, GFP_ATOMIC);
>  		}
>  	} while (!table && size > PAGE_SIZE && --log2qty);
>  
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html