linux-kernel - Re: [PATCH v2 5/9] mm/hugetlb: convert isolate_or_dissolve_huge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20230613232940.GA3898@monkey>
Date:   Tue, 13 Jun 2023 16:29:40 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     Sidhartha Kumar <sidhartha.kumar@...cle.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        akpm@...ux-foundation.org, songmuchun@...edance.com,
        almasrymina@...gle.com, linmiaohe@...wei.com,
        minhquangbui99@...il.com, aneesh.kumar@...ux.ibm.com
Subject: Re: [PATCH v2 5/9] mm/hugetlb: convert isolate_or_dissolve_huge_page
 to folios

On 06/12/23 16:34, Mike Kravetz wrote:
> On 06/12/23 18:41, Matthew Wilcox wrote:
> > On Tue, Nov 01, 2022 at 03:30:55PM -0700, Sidhartha Kumar wrote:
> > > +++ b/mm/hugetlb.c
> > > @@ -2815,7 +2815,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
> > >  int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> > >  {
> > >  	struct hstate *h;
> > > -	struct page *head;
> > > +	struct folio *folio = page_folio(page);
> > 
> > Is this safe?  I was reviewing a different patch today, and I spotted
> > this.  With THP, we can relatively easily hit this case:
> > 
> > struct page points to a page with pfn 0x40305, in a folio of order 2.
> > We call page_folio() on it and the resulting pointer is for the folio
> > with pfn 0x40304.
> > If we don't have our own refcount (or some other protection ...) against
> > freeing, the folio can now be freed and reallocated.  Say it's now part
> > of an order-3 folio.
> > Our 'folio' pointer is now actually a pointer to a tail page, and we
> > have various assertions that a folio pointer doesn't point to a tail
> > page, so they trigger.
> > 
> > It seems to me that this ...
> > 
> >         /*
> >          * The page might have been dissolved from under our feet, so make sure
> >          * to carefully check the state under the lock.
> >          * Return success when racing as if we dissolved the page ourselves.
> >          */
> >         spin_lock_irq(&hugetlb_lock);
> >         if (folio_test_hugetlb(folio)) {
> >                 h = folio_hstate(folio);
> >         } else {
> >                 spin_unlock_irq(&hugetlb_lock);
> >                 return 0;
> >         }
> > 
> > implies that we don't have our own reference on the folio, so we might
> > find a situation where the folio pointer we have is no longer a folio
> > pointer.
> 
> Your analysis is correct.
> 
> This is not safe because we hold no locks or references.  The folio
> pointer obtained via page_folio(page) may not be valid when calling
> folio_test_hugetlb(folio) and later.
> 
> My bad for the Reviewed-by: :(
> 

I was looking at this more closely and need a bit of clarification.  As
mentioned, your analysis is correct.  However, it appears that there is
other code doing:

folio = page_folio(page);
...
if (folio_test_hugetlb(folio))

without holding a folio ref or some type of lock.  split_huge_pages_all()
is one such example.

So, either this code has the same issue or there are folio routines that
can be called without holding a ref/lock.  The kerneldoc for
folio_test_hugetlb says "Caller should have a reference on the folio to
prevent it from being turned into a tail page.".  However, is that mostly
to make sure the returned value is consistent/valid?  Can it really lead
to an assert if folio pointer is changed to point to something else?

> > Maybe the page_folio() call should be moved inside the hugetlb_lock
> > protection?  Is that enough?  I don't know enough about how hugetlb
> > pages are split, freed & allocated to know what's going on.

Upon further thought, I think we should move the page_folio() inside the
lock just to be more correct.

> > 
> > But then we _drop_ the lock, and keep referring to ...
> > 
> > > @@ -2841,10 +2840,10 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> > >  	if (hstate_is_gigantic(h))
> > >  		return -ENOMEM;
> > >  
> > > -	if (page_count(head) && !isolate_hugetlb(head, list))
> > > +	if (folio_ref_count(folio) && !isolate_hugetlb(&folio->page, list))
> > >  		ret = 0;
> > > -	else if (!page_count(head))
> > > -		ret = alloc_and_dissolve_huge_page(h, head, list);
> > > +	else if (!folio_ref_count(folio))
> > > +		ret = alloc_and_dissolve_huge_page(h, &folio->page, list);
> 
> The above was OK when using struct page instead of folio.  The 'racy'
> part was getting the ref count on the head page.  It was OK because this
> was only a check to see if we should TRY to isolate or dissolve.  The
> code to actually isolate or dissolve would take the appropriate locks.

page_count() is doing 'folio_ref_count(page_folio(page));' and there I suspect
there are many places doing page_count without taking a page ref or locking.
So, it seems like this would also be safe?

> I'm afraid the code is now making even more use of a potentially invalid
> folio.  Here is how the above now looks in v6.3:
> 
> 	spin_unlock_irq(&hugetlb_lock);
> 
> 	/*
> 	 * Fence off gigantic pages as there is a cyclic dependency between
> 	 * alloc_contig_range and them. Return -ENOMEM as this has the effect
> 	 * of bailing out right away without further retrying.
> 	 */
> 	if (hstate_is_gigantic(h))
> 		return -ENOMEM;
> 
> 	if (folio_ref_count(folio) && isolate_hugetlb(folio, list))
> 		ret = 0;
> 	else if (!folio_ref_count(folio))
> 		ret = alloc_and_dissolve_hugetlb_folio(h, folio, list);
> 
> Looks like that potentially invalid folio is being passed to other
> routines.  Previous code would take lock and revalidate that struct page
> was still a hugetlb page.  We can not do the same with a folio.

Perhaps I spoke too soon.  Yes, we pass a potentially invalid folio
pointer to isolate_hugetlb() and alloc_and_dissolve_hugetlb_folio().
However, it seems the validation they perform should be sufficient.

bool isolate_hugetlb(struct folio *folio, struct list_head *list)
{
	bool ret = true;

	spin_lock_irq(&hugetlb_lock);
	if (!folio_test_hugetlb(folio) ||
	    !folio_test_hugetlb_migratable(folio) ||
	    !folio_try_get(folio)) {
		ret = false;
		goto unlock;


static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
			struct folio *old_folio, struct list_head *list)
{
	...
retry:
	spin_lock_irq(&hugetlb_lock);
	if (!folio_test_hugetlb(old_folio)) {
		...
	} else if (folio_ref_count(old_folio)) {
		...
	} else if (!folio_test_hugetlb_freed(old_folio)) {
		...
		goto retry;
	} else {
		/*
		 * Ok, old_folio is still a genuine free hugepage.

Upon further consideration, I do not see an issue with the existing
code.  If there are issues with calling folio_test_hugetlb() or
folio_ref_count() on a potentially invalid folio pointer, then we do
have issues here.  However, such an issue would be more widespread as
there is more code doing the same. 
-- 
Mike Kravetz