linux-kernel - Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACw3F52hZp+xC51bK=UfrfdAWdFEEqhsiJY2+0fzCLzB_3=XYg@mail.gmail.com>
Date: Fri, 26 Dec 2025 17:50:59 -0800
From: Jiaqi Yan <jiaqiyan@...gle.com>
To: Harry Yoo <harry.yoo@...cle.com>
Cc: jackmanb@...gle.com, hannes@...xchg.org, linmiaohe@...wei.com, 
	ziy@...dia.com, willy@...radead.org, nao.horiguchi@...il.com, 
	david@...hat.com, lorenzo.stoakes@...cle.com, william.roche@...cle.com, 
	tony.luck@...el.com, wangkefeng.wang@...wei.com, jane.chu@...cle.com, 
	akpm@...ux-foundation.org, osalvador@...e.de, muchun.song@...ux.dev, 
	rientjes@...gle.com, duenwen@...gle.com, jthoughton@...gle.com, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, Liam.Howlett@...cle.com, 
	vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com
Subject: Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in
 high-order HWPoison folio

On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@...cle.com> wrote:
>
> On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > folio becomes non-HugeTLB, it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio. However, there is always a time
> > window between dissolve_free_hugetlb_folio frees a HWPoison
> > high-order folio to buddy allocator and MFR takes HWPoison
> > raw page off buddy allocator.
> >
> > One obvious way to avoid this problem is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce free_has_hwpoison_pages to only free the healthy
> > pages and excludes the HWPoison ones in the high-order folio.
> > The idea is to iterate through the sub-pages of the folio to
> > identify contiguous ranges of healthy pages. Instead of freeing
> > pages one by one, decompose healthy ranges into the largest
> > possible blocks. Each block meets the requirements to be freed
> > to buddy allocator (__free_frozen_pages).
> >
> > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > number of pages in the folio. While the power-of-two decomposition
> > ensures that the number of calls to the buddy allocator is
> > logarithmic for each contiguous healthy range, the mandatory
> > linear scan of pages to identify PageHWPoison defines the
> > overall time complexity.
>
> Hi Jiaqi, thanks for the patch!

Thanks for your review/comments!

>
> Have you tried measuring the latency of free_has_hwpoison_pages() when
> a few pages in a 1GB folio are hwpoisoned?
>
> Just wanted to make sure we don't introduce a possible soft lockup...
> Or am I worrying too much?

In my local tests, freeing a 1GB folio with 1 / 3 / 8 HWPoison pages,
I never run into a soft lockup. The 8-HWPoison-page case takes more
time than other cases, meaning that handling the additional HWPoison
page adds to the time cost.

After adding some instrument code, 10 sample runs of
free_has_hwpoison_pages with 8 HWPoison pages:
- observed mean is 7.03 ms (5.97 ms when 3 HWPoison pages)
- observed standard deviation is 0.76 ms (0.18 ms when 3 HWPoison pages)

In comparison, freeing a 1G folio without any HWPoison pages 10 times
(with same kernel config):
- observed mean is 3.39 ms
- observed standard deviation is 0.16ms

So it's around twice the baseline. It should be far from triggering a
soft lockup, and the cost seems fair for handling exceptional hardware
memory errors.

I can add these measurements in future revisions.

>
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@...gle.com>
> > ---
> >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 101 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 822e05f1a9646..20c8862ce594e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> >       }
> >  }
> >
> > +static void prepare_compound_page_to_free(struct page *new_head,
> > +                                       unsigned int order,
> > +                                       unsigned long flags)
> > +{
> > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > +     new_head->mapping = NULL;
> > +     new_head->private = 0;
> > +
> > +     clear_compound_head(new_head);
> > +     if (order)
> > +             prep_compound_page(new_head, order);
> > +}
>
> Not sure why it's building compound pages, just to decompose them
> when freeing via __free_frozen_pages()?

prepare_compound_page_to_free() borrowed the idea from
__split_folio_to_order(). Conceptually the original folio is split
into new compound pages with different orders; here this is done on
the fly in free_contiguous_pages() when order is decided.
__free_frozen_pages() is also happy with a compound page when order >
0, as I tested with free_pages_prepare before calling
__free_frozen_pages().

>
> If you intended to reset compound head & tails, I think it's more
> readable to decompose the whole compound page at once and not build
> compound pages when freeing it?

I don't think prepare_compound_page_to_free() is that hard to read,
but I'm open to more opinions.

>
> > +/*
> > + * Given a range of pages physically contiguous physical, efficiently
> > + * free them in blocks that meet __free_frozen_pages's requirements.
> > + */
> > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > +                               unsigned long flags)
> > +{
> > +     unsigned int order;
> > +     unsigned int align_order;
> > +     unsigned int size_order;
> > +     unsigned long pfn;
> > +     unsigned long end_pfn = page_to_pfn(next);
> > +     unsigned long remaining;
> > +
> > +     /*
> > +      * This decomposition algorithm at every iteration chooses the
> > +      * order to be the minimum of two constraints:
> > +      * - Alignment: the largest power-of-two that divides current pfn.
> > +      * - Size: the largest power-of-two that fits in the
> > +      *   current remaining number of pages.
> > +      */
> > +     while (curr < next) {
> > +             pfn = page_to_pfn(curr);
> > +             remaining = end_pfn - pfn;
> > +
> > +             align_order = ffs(pfn) - 1;
> > +             size_order = fls_long(remaining) - 1;
> > +             order = min(align_order, size_order);
> > +
> > +             prepare_compound_page_to_free(curr, order, flags);
> > +             __free_frozen_pages(curr, order, FPI_NONE);
> > +             curr += (1UL << order);
> > +     }
> > +
> > +     VM_WARN_ON(curr != next);
> > +}
> > +
> > +/*
> > + * Given a high-order compound page containing certain number of HWPoison
> > + * pages, free only the healthy ones to buddy allocator.
> > + *
> > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > + * overhead. So only use this when compound page really contains HWPoison.
> > + *
> > + * This implementation doesn't work in memdesc world.
> > + */
> > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > +{
> > +     struct page *curr = page;
> > +     struct page *end = page + (1 << order);
> > +     struct page *next;
> > +     unsigned long flags = page->flags.f;
> > +     unsigned long nr_pages;
> > +     unsigned long total_freed = 0;
> > +     unsigned long total_hwp = 0;
> > +
> > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > +
> > +     while (curr < end) {
> > +             next = curr;
> > +             nr_pages = 0;
> > +
> > +             while (next < end && !PageHWPoison(next)) {
> > +                     ++next;
> > +                     ++nr_pages;
> > +             }
> > +
> > +             if (PageHWPoison(next))
> > +                     ++total_hwp;
> > +
> > +             free_contiguous_pages(curr, next, flags);
>
> page_owner, memory profiling (anything else?) will be confused
> because it was allocated as a larger size, but we're freeing only
> some portion of it.

I am not sure, but looking at __split_unmapped_folio, it calls
pgalloc_tag_split(folio, old_order, split_order) when splitting an
old_order-order folio into a new split_order.

Maybe prepare_compound_page_to_free really should
update_page_tag_ref(), I need to take a closer look at this with
CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).

>
> Perhaps we need to run some portion of this code snippet
> (from free_pages_prepare()), before freeing portions of it:
>
>         page_cpupid_reset_last(page);
>         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>         reset_page_owner(page, order);
>         page_table_check_free(page, order);
>         pgalloc_tag_sub(page, 1 << order);

Since they come from free_pages_prepare, I believe these lines are
already executed via free_contiguous_pages()=> __free_frozen_pages()=>
free_pages_prepare(), right? Or am I missing something?

>
> > +             total_freed += nr_pages;
> > +             curr = PageHWPoison(next) ? next + 1 : next;
> > +     }
> > +
> > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > +}
> > +
> >  void free_frozen_pages(struct page *page, unsigned int order)
> >  {
> > +     struct folio *folio = page_folio(page);
> > +
> > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > +             folio_clear_has_hwpoisoned(folio);
> > +             free_has_hwpoison_pages(page, order);
> > +             return;
> > +     }
> > +
>
> It feels like it's a bit random place to do has_hwpoisoned check.
> Can we move this to free_pages_prepare() where we have some
> sanity checks (and also order-0 hwpoison page handling)?

While free_pages_prepare() seems to be a better place to do the
has_hwpoisoned check, it is not a good place to do
free_has_hwpoison_pages().


>
> >       __free_frozen_pages(page, order, FPI_NONE);
> >  }
> >
> > --
> > 2.52.0.322.g1dd061c0dc-goog
>
> --
> Cheers,
> Harry / Hyeonggon