[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20230613232940.GA3898@monkey>
Date: Tue, 13 Jun 2023 16:29:40 -0700
From: Mike Kravetz <mike.kravetz@...cle.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: Sidhartha Kumar <sidhartha.kumar@...cle.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
akpm@...ux-foundation.org, songmuchun@...edance.com,
almasrymina@...gle.com, linmiaohe@...wei.com,
minhquangbui99@...il.com, aneesh.kumar@...ux.ibm.com
Subject: Re: [PATCH v2 5/9] mm/hugetlb: convert isolate_or_dissolve_huge_page
to folios
On 06/12/23 16:34, Mike Kravetz wrote:
> On 06/12/23 18:41, Matthew Wilcox wrote:
> > On Tue, Nov 01, 2022 at 03:30:55PM -0700, Sidhartha Kumar wrote:
> > > +++ b/mm/hugetlb.c
> > > @@ -2815,7 +2815,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
> > > int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> > > {
> > > struct hstate *h;
> > > - struct page *head;
> > > + struct folio *folio = page_folio(page);
> >
> > Is this safe? I was reviewing a different patch today, and I spotted
> > this. With THP, we can relatively easily hit this case:
> >
> > struct page points to a page with pfn 0x40305, in a folio of order 2.
> > We call page_folio() on it and the resulting pointer is for the folio
> > with pfn 0x40304.
> > If we don't have our own refcount (or some other protection ...) against
> > freeing, the folio can now be freed and reallocated. Say it's now part
> > of an order-3 folio.
> > Our 'folio' pointer is now actually a pointer to a tail page, and we
> > have various assertions that a folio pointer doesn't point to a tail
> > page, so they trigger.
> >
> > It seems to me that this ...
> >
> > /*
> > * The page might have been dissolved from under our feet, so make sure
> > * to carefully check the state under the lock.
> > * Return success when racing as if we dissolved the page ourselves.
> > */
> > spin_lock_irq(&hugetlb_lock);
> > if (folio_test_hugetlb(folio)) {
> > h = folio_hstate(folio);
> > } else {
> > spin_unlock_irq(&hugetlb_lock);
> > return 0;
> > }
> >
> > implies that we don't have our own reference on the folio, so we might
> > find a situation where the folio pointer we have is no longer a folio
> > pointer.
>
> Your analysis is correct.
>
> This is not safe because we hold no locks or references. The folio
> pointer obtained via page_folio(page) may not be valid when calling
> folio_test_hugetlb(folio) and later.
>
> My bad for the Reviewed-by: :(
>
I was looking at this more closely and need a bit of clarification. As
mentioned, your analysis is correct. However, it appears that there is
other code doing:
folio = page_folio(page);
...
if (folio_test_hugetlb(folio))
without holding a folio ref or some type of lock. split_huge_pages_all()
is one such example.
So, either this code has the same issue or there are folio routines that
can be called without holding a ref/lock. The kerneldoc for
folio_test_hugetlb says "Caller should have a reference on the folio to
prevent it from being turned into a tail page.". However, is that mostly
to make sure the returned value is consistent/valid? Can it really lead
to an assert if folio pointer is changed to point to something else?
> > Maybe the page_folio() call should be moved inside the hugetlb_lock
> > protection? Is that enough? I don't know enough about how hugetlb
> > pages are split, freed & allocated to know what's going on.
Upon further thought, I think we should move the page_folio() inside the
lock just to be more correct.
> >
> > But then we _drop_ the lock, and keep referring to ...
> >
> > > @@ -2841,10 +2840,10 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> > > if (hstate_is_gigantic(h))
> > > return -ENOMEM;
> > >
> > > - if (page_count(head) && !isolate_hugetlb(head, list))
> > > + if (folio_ref_count(folio) && !isolate_hugetlb(&folio->page, list))
> > > ret = 0;
> > > - else if (!page_count(head))
> > > - ret = alloc_and_dissolve_huge_page(h, head, list);
> > > + else if (!folio_ref_count(folio))
> > > + ret = alloc_and_dissolve_huge_page(h, &folio->page, list);
>
> The above was OK when using struct page instead of folio. The 'racy'
> part was getting the ref count on the head page. It was OK because this
> was only a check to see if we should TRY to isolate or dissolve. The
> code to actually isolate or dissolve would take the appropriate locks.
page_count() is doing 'folio_ref_count(page_folio(page));' and there I suspect
there are many places doing page_count without taking a page ref or locking.
So, it seems like this would also be safe?
> I'm afraid the code is now making even more use of a potentially invalid
> folio. Here is how the above now looks in v6.3:
>
> spin_unlock_irq(&hugetlb_lock);
>
> /*
> * Fence off gigantic pages as there is a cyclic dependency between
> * alloc_contig_range and them. Return -ENOMEM as this has the effect
> * of bailing out right away without further retrying.
> */
> if (hstate_is_gigantic(h))
> return -ENOMEM;
>
> if (folio_ref_count(folio) && isolate_hugetlb(folio, list))
> ret = 0;
> else if (!folio_ref_count(folio))
> ret = alloc_and_dissolve_hugetlb_folio(h, folio, list);
>
> Looks like that potentially invalid folio is being passed to other
> routines. Previous code would take lock and revalidate that struct page
> was still a hugetlb page. We can not do the same with a folio.
Perhaps I spoke too soon. Yes, we pass a potentially invalid folio
pointer to isolate_hugetlb() and alloc_and_dissolve_hugetlb_folio().
However, it seems the validation they perform should be sufficient.
bool isolate_hugetlb(struct folio *folio, struct list_head *list)
{
bool ret = true;
spin_lock_irq(&hugetlb_lock);
if (!folio_test_hugetlb(folio) ||
!folio_test_hugetlb_migratable(folio) ||
!folio_try_get(folio)) {
ret = false;
goto unlock;
static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
struct folio *old_folio, struct list_head *list)
{
...
retry:
spin_lock_irq(&hugetlb_lock);
if (!folio_test_hugetlb(old_folio)) {
...
} else if (folio_ref_count(old_folio)) {
...
} else if (!folio_test_hugetlb_freed(old_folio)) {
...
goto retry;
} else {
/*
* Ok, old_folio is still a genuine free hugepage.
Upon further consideration, I do not see an issue with the existing
code. If there are issues with calling folio_test_hugetlb() or
folio_ref_count() on a potentially invalid folio pointer, then we do
have issues here. However, such an issue would be more widespread as
there is more code doing the same.
--
Mike Kravetz
Powered by blists - more mailing lists