linux-kernel - Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <53462094.6000407@jp.fujitsu.com>
Date:	Thu, 10 Apr 2014 13:39:48 +0900
From:	Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>
To:	Luiz Capitulino <lcapitulino@...hat.com>
CC:	<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
	<mtosatti@...hat.com>, <aarcange@...hat.com>, <mgorman@...e.de>,
	<akpm@...ux-foundation.org>, <andi@...stfloor.org>,
	<davidlohr@...com>, <rientjes@...gle.com>, <yinghai@...nel.org>,
	<riel@...hat.com>, <n-horiguchi@...jp.nec.com>
Subject: Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation
 at runtime

(2014/04/10 2:56), Luiz Capitulino wrote:
> On Wed, 9 Apr 2014 09:42:01 +0900
> Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com> wrote:
>
>> (2014/04/09 4:02), Luiz Capitulino wrote:
>>> HugeTLB is limited to allocating hugepages whose size are less than
>>> MAX_ORDER order. This is so because HugeTLB allocates hugepages via
>>> the buddy allocator. Gigantic pages (that is, pages whose size is
>>> greater than MAX_ORDER order) have to be allocated at boottime.
>>>
>>> However, boottime allocation has at least two serious problems. First,
>>> it doesn't support NUMA and second, gigantic pages allocated at
>>> boottime can't be freed.
>>>
>>> This commit solves both issues by adding support for allocating gigantic
>>> pages during runtime. It works just like regular sized hugepages,
>>> meaning that the interface in sysfs is the same, it supports NUMA,
>>> and gigantic pages can be freed.
>>>
>>> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
>>> gigantic pages on node 1, one can do:
>>>
>>>    # echo 2 > \
>>>      /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>>>
>>> And to free them all:
>>>
>>>    # echo 0 > \
>>>      /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>>>
>>> The one problem with gigantic page allocation at runtime is that it
>>> can't be serviced by the buddy allocator. To overcome that problem, this
>>> commit scans all zones from a node looking for a large enough contiguous
>>> region. When one is found, it's allocated by using CMA, that is, we call
>>> alloc_contig_range() to do the actual allocation. For example, on x86_64
>>> we scan all zones looking for a 1GB contiguous region. When one is found,
>>> it's allocated by alloc_contig_range().
>>>
>>> One expected issue with that approach is that such gigantic contiguous
>>> regions tend to vanish as runtime goes by. The best way to avoid this for
>>> now is to make gigantic page allocations very early during system boot, say
>>> from a init script. Other possible optimization include using compaction,
>>> which is supported by CMA but is not explicitly used by this commit.
>>>
>>> It's also important to note the following:
>>>
>>>    1. Gigantic pages allocated at boottime by the hugepages= command-line
>>>       option can be freed at runtime just fine
>>>
>>>    2. This commit adds support for gigantic pages only to x86_64. The
>>>       reason is that I don't have access to nor experience with other archs.
>>>       The code is arch indepedent though, so it should be simple to add
>>>       support to different archs
>>>
>>>    3. I didn't add support for hugepage overcommit, that is allocating
>>>       a gigantic page on demand when
>>>      /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
>>>      think it's reasonable to do the hard and long work required for
>>>      allocating a gigantic page at fault time. But it should be simple
>>>      to add this if wanted
>>>
>>> Signed-off-by: Luiz Capitulino <lcapitulino@...hat.com>
>>> ---
>>>    mm/hugetlb.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>    1 file changed, 147 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 9dded98..2258045 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -679,11 +679,141 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>>>    		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
>>>    		nr_nodes--)
>>>
>>> +#if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
>>> +static void destroy_compound_gigantic_page(struct page *page,
>>> +					unsigned long order)
>>> +{
>>> +	int i;
>>> +	int nr_pages = 1 << order;
>>> +	struct page *p = page + 1;
>>> +
>>> +	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
>>> +		__ClearPageTail(p);
>>> +		set_page_refcounted(p);
>>> +		p->first_page = NULL;
>>> +	}
>>> +
>>> +	set_compound_order(page, 0);
>>> +	__ClearPageHead(page);
>>> +}
>>> +
>>> +static void free_gigantic_page(struct page *page, unsigned order)
>>> +{
>>> +	free_contig_range(page_to_pfn(page), 1 << order);
>>> +}
>>> +
>>> +static int __alloc_gigantic_page(unsigned long start_pfn, unsigned long count)
>>> +{
>>> +	unsigned long end_pfn = start_pfn + count;
>>> +	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>>> +}
>>> +
>>> +static bool pfn_range_valid_gigantic(unsigned long start_pfn,
>>> +				unsigned long nr_pages)
>>> +{
>>> +	unsigned long i, end_pfn = start_pfn + nr_pages;
>>> +	struct page *page;
>>> +
>>> +	for (i = start_pfn; i < end_pfn; i++) {
>>> +		if (!pfn_valid(i))
>>> +			return false;
>>> +
>>> +		page = pfn_to_page(i);
>>> +
>>> +		if (PageReserved(page))
>>> +			return false;
>>> +
>>> +		if (page_count(page) > 0)
>>> +			return false;
>>> +
>>> +		if (PageHuge(page))
>>> +			return false;
>>> +	}
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static struct page *alloc_gigantic_page(int nid, unsigned order)
>>> +{
>>> +	unsigned long nr_pages = 1 << order;
>>> +	unsigned long ret, pfn, flags;
>>> +	struct zone *z;
>>> +
>>> +	z = NODE_DATA(nid)->node_zones;
>>> +	for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
>>> +		spin_lock_irqsave(&z->lock, flags);
>>> +
>>> +		pfn = ALIGN(z->zone_start_pfn, nr_pages);
>>> +		for (; pfn < zone_end_pfn(z); pfn += nr_pages) {
>>
>>> +			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
>>
>> How about it. It can reduce the indentation level.
>> 			if (!pfn_range_valid_gigantic(...))
>> 				continue;
>>
>> And I think following check is necessary:
>> 			if (pfn + nr_pages >= zone_end_pfn(z))
>> 				break;
>
> You're right that we have to check if the whole range is within the zone,
> but the check above is off-by-one. What about the following:
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 01c0068..b01cdeb 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -734,6 +734,13 @@ static bool pfn_range_valid_gigantic(unsigned long start_pfn,
>   	return true;
>   }
>
> +static bool zone_spans_last_pfn(const struct zone *zone,
> +			unsigned long start_pfn, unsigned long nr_pages)
> +{
> +	unsigned long last_pfn = start_pfn + nr_pages - 1;
> +	return zone_spans_pfn(zone, last_pfn);
> +}
> +
>   static struct page *alloc_gigantic_page(int nid, unsigned order)
>   {
>   	unsigned long nr_pages = 1 << order;
> @@ -745,7 +752,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
>   		spin_lock_irqsave(&z->lock, flags);
>
>   		pfn = ALIGN(z->zone_start_pfn, nr_pages);
> -		for (; pfn < zone_end_pfn(z); pfn += nr_pages) {
> +		while (zone_spans_last_pfn(z, pfn, nr_pages)) {
>   			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
>   				/*
>   				 * We release the zone lock here because
> @@ -760,6 +767,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
>   					return pfn_to_page(pfn);
>   				spin_lock_irqsave(&z->lock, flags);
>   			}
> +			pfn += nr_pages;
>   		}
>
>   		spin_unlock_irqrestore(&z->lock, flags);
>

Looks good to me.

Thanks,
Yasuaki Ishimatsu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/