linux-kernel - Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171129071737.vjg2sckpkzelifr2@dhcp22.suse.cz>
Date:   Wed, 29 Nov 2017 08:17:37 +0100
From:   Michal Hocko <mhocko@...nel.org>
To:     Mike Kravetz <mike.kravetz@...cle.com>
Cc:     linux-mm@...ck.org, Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit
 during migration

On Tue 28-11-17 17:39:50, Mike Kravetz wrote:
> On 11/28/2017 06:12 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@...e.com>
> > 
> > hugepage migration relies on __alloc_buddy_huge_page to get a new page.
> > This has 2 main disadvantages.
> > 1) it doesn't allow to migrate any huge page if the pool is used
> > completely which is not an exceptional case as the pool is static and
> > unused memory is just wasted.
> > 2) it leads to a weird semantic when migration between two numa nodes
> > might increase the pool size of the destination NUMA node while the page
> > is in use. The issue is caused by per NUMA node surplus pages tracking
> > (see free_huge_page).
> > 
> > Address both issues by changing the way how we allocate and account
> > pages allocated for migration. Those should temporal by definition.
> > So we mark them that way (we will abuse page flags in the 3rd page)
> > and update free_huge_page to free such pages to the page allocator.
> > Page migration path then just transfers the temporal status from the
> > new page to the old one which will be freed on the last reference.
> 
> In general, I think this will work.  Some questions below.
> 
> > The global surplus count will never change during this path but we still
> > have to be careful when freeing a page from a node with surplus pages
> > on the node.
> 
> Not sure about the "freeing page from a node with surplus pages" comment.
> If allocating PageHugeTemporary pages does not adjust surplus counts, then
> there should be no concern at the time of freeing.
> 
> Could this comment be a hold over from a previous implementation attempt?
> 

Not really. You have to realize that the original page could be surplus
on its node. More on that below.

[...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8189c92fac82..037bf0f89463 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
> >  	if (restore_reserve)
> >  		h->resv_huge_pages++;
> >  
> > -	if (h->surplus_huge_pages_node[nid]) {
> > +	if (PageHugeTemporary(page)) {
> > +		list_del(&page->lru);
> > +		ClearPageHugeTemporary(page);
> > +		update_and_free_page(h, page);
> > +		if (h->surplus_huge_pages_node[nid])
> > +			h->surplus_huge_pages_node[nid]--;
> 
> I think this is not correct.  Should the lines dealing with per-node
> surplus counts even be here?  If the lines above are correct, then it
> implies that the sum of per node surplus counts could exceed (or get out
> of sync with) the global surplus count.

You are right, I guess. This per-node accounting makes the whole thing
real pain. I am worried that we will free next page from the same node
and reduce the overal pool size. I will think about it some more.

> > +	} else if (h->surplus_huge_pages_node[nid]) {
> >  		/* remove the page from active list */
> >  		list_del(&page->lru);
> >  		update_and_free_page(h, page);
> > @@ -1531,7 +1537,11 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> >  	return rc;
> >  }
> >  
> > -static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
> > +/*
> > + * Allocates a fresh surplus page from the page allocator. Temporary
> > + * requests (e.g. page migration) can pass enforce_overcommit == false
> 
> 'enforce_overcommit == false' perhaps part of an earlier implementation
> attempt?

yeah.

[...]

> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 4d0be47a322a..b3345f8174a9 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1326,6 +1326,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >  		hugetlb_cgroup_migrate(hpage, new_hpage);
> >  		put_new_page = NULL;
> >  		set_page_owner_migrate_reason(new_hpage, reason);
> > +
> > +		/*
> > +		 * transfer temporary state of the new huge page. This is
> > +		 * reverse to other transitions because the newpage is going to
> > +		 * be final while the old one will be freed so it takes over
> > +		 * the temporary status.
> > +		 * No need for any locking here because destructor cannot race
> > +		 * with us.
> > +		 */
> > +		if (PageHugeTemporary(new_hpage)) {
> > +			SetPageHugeTemporary(hpage);
> > +			ClearPageHugeTemporary(new_hpage);
> > +		}
> >  	}
> >  
> >  	unlock_page(hpage);
> > 
> 
> I'm still trying to wrap my head around all the different scenarios.
> In general, this new code only 'kicks in' if the there is not a free
> pre-allocated huge page for migration.  Right?

yes

> So, if there are free huge pages they are 'consumed' during migration
> and the number of available pre-allocated huge pages is reduced?  Or,
> is that not exactly how it works?  Or does it depend in the purpose
> of the migration?

Well, if we have pre-allocated pages then we just consume them and they
will not get Temporary status so the additional code doesn't kick in.

> The only reason I ask is because this new method of allocating a surplus
> page (if successful) results in no decrease of available huge pages.
> Perhaps all migrations should attempt to allocate surplus pages and not
> impact the pre-allocated number of available huge pages.

That could reduce the chances of the migration success because
allocating a fresh huge page can fail.

-- 
Michal Hocko
SUSE Labs