linux-kernel - Re: [RFC PATCH] mm,memory_hotplug: Unlock 1GB-hugetlb on x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c4fc87f2-9ff8-3bc1-e990-da97c56ba18f@oracle.com>
Date:   Thu, 21 Feb 2019 14:12:19 -0800
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     Oscar Salvador <osalvador@...e.de>, linux-mm@...ck.org
Cc:     linux-kernel@...r.kernel.org, mhocko@...e.com, david@...hat.com
Subject: Re: [RFC PATCH] mm,memory_hotplug: Unlock 1GB-hugetlb on x86_64

On 2/21/19 1:42 AM, Oscar Salvador wrote:
> On x86_64, 1GB-hugetlb pages could never be offlined due to the fact
> that hugepage_migration_supported() returned false for PUD_SHIFT.
> So whenever we wanted to offline a memblock containing a gigantic
> hugetlb page, we never got beyond has_unmovable_pages() check.
> This changed with [1], where now we also return true for PUD_SHIFT.
> 
> After that patch, the check in has_unmovable_pages() and scan_movable_pages()
> returned true, but we still had a final barrier in do_migrate_range():
> 
> if (compound_order(head) > PFN_SECTION_SHIFT) {
> 	ret = -EBUSY;
> 	break;
> }
> 
> This is not really nice, and we do not really need it.
> It is perfectly possible to migrate a gigantic page as long as another node has
> a spare gigantic page for us.
> In alloc_huge_page_nodemask(), we calculate the __real__ number of free pages,
> and if any, we try to dequeue one from another node.
> 
> This all works fine when we do have another node with a spare gigantic page,
> but if that is not the case, alloc_huge_page_nodemask() ends up calling
> alloc_migrate_huge_page() which bails out if the wanted page is gigantic.
> That is mainly because finding a 1GB (or even 16GB on powerpc) contiguous
> memory is quite unlikely when the system has been running for a while.

I suspect the reason for the check is that it was there before the ability
to migrate gigantic pages was added, and nobody thought to remove it.  As
you say, the likelihood of finding a gigantic page after running for some
time is not too good.  I wonder if we should remove that check?  Just trying
to create a gigantic page could result in a bunch of migrations which could
impact the system.  But, this is the result of a memory offline operation
which one would expect to have some negative impact.

> In that situation, we will keep looping forever because scan_movable_pages()
> will give us the same page and we will fail again because there is no node
> where we can dequeue a gigantic page from.
> This is not nice, and I wish we could differentiate a fatal error from a
> transient error in do_migrate_range()->migrate_pages(), but I do not really
> see a way now.

Michal may have some thoughts here.  Note that the repeat loop does not
even consider the return value from do_migrate_range().  Since this the the
result of an offline, I am thinking it was designed to retry forever.  But,
perhaps there are some errors/ret codes where we should give up?

> Anyway, I would tend say that this is the administrator's job, to make sure
> that the system can keep up with the memory to be offlined, so that would mean
> that if we want to use gigantic pages, make sure that the other nodes have at
> least enough gigantic pages to keep up in case we need to offline memory.
> 
> Just for the sake of completeness, this is one of the tests done:
> 
>  # echo 1 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>  # echo 1 > /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages
> 
>  # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>    1
>  # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
>    1
> 
>  # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages
>    1
>  # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
>    1
> 
>  (hugetlb1gb is a program that maps 1GB region using MAP_HUGE_1GB)
> 
>  # numactl -m 1 ./hugetlb1gb
>  # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
>    0
>  # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
>    1
> 
>  # offline node1 memory
>  # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
>    0
> 
> [1] https://lore.kernel.org/patchwork/patch/998796/
> 
> Signed-off-by: Oscar Salvador <osalvador@...e.de>
> ---
>  mm/memory_hotplug.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d5f7afda67db..04f6695b648c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>  		if (!PageHuge(page))
>  			continue;
>  		head = compound_head(page);
> -		if (hugepage_migration_supported(page_hstate(head)) &&
> -		    page_huge_active(head))
> +		if (page_huge_active(head))

I'm confused as to why the removal of the hugepage_migration_supported()
check is required.  Seems that commit aa9d95fa40a2 ("mm/hugetlb: enable
arch specific huge page size support for migration") should make the check
work as desired for all architectures.
-- 
Mike Kravetz