linux-kernel - Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aR257PivQXpEGbKb@hyeyoo>
Date: Wed, 19 Nov 2025 21:37:00 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Zi Yan <ziy@...dia.com>
Cc: Jiaqi Yan <jiaqiyan@...gle.com>, Matthew Wilcox <willy@...radead.org>,
        david@...hat.com, Vlastimil Babka <vbabka@...e.cz>,
        nao.horiguchi@...il.com, linmiaohe@...wei.com,
        lorenzo.stoakes@...cle.com, william.roche@...cle.com,
        tony.luck@...el.com, wangkefeng.wang@...wei.com, jane.chu@...cle.com,
        akpm@...ux-foundation.org, osalvador@...e.de, muchun.song@...ux.dev,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, Michal Hocko <mhocko@...e.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Brendan Jackman <jackmanb@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>
Subject: Re: [PATCH v1 1/2] mm/huge_memory: introduce
 uniform_split_unmapped_folio_to_zero_order

On Tue, Nov 18, 2025 at 04:54:31PM -0500, Zi Yan wrote:
> On 18 Nov 2025, at 14:26, Jiaqi Yan wrote:
> 
> > On Tue, Nov 18, 2025 at 2:20 AM Harry Yoo <harry.yoo@...cle.com> wrote:
> >>
> >> On Mon, Nov 17, 2025 at 10:24:27PM -0800, Jiaqi Yan wrote:
> >>> On Mon, Nov 17, 2025 at 5:43 AM Matthew Wilcox <willy@...radead.org> wrote:
> >>>>
> >>>> On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
> >>>>> On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> >>>>>> But since we're only doing this on free, we won't need to do folio
> >>>>>> allocations at all; we'll just be able to release the good pages to the
> >>>>>> page allocator and sequester the hwpoison pages.
> >>>>>
> >>>>> [+Cc PAGE ALLOCATOR folks]
> >>>>>
> >>>>> So we need an interface to free only healthy portion of a hwpoison folio.
> >>>
> >>> +1, with some of my own thoughts below.
> >>>
> >>>>>
> >>>>> I think a proper approach to this should be to "free a hwpoison folio
> >>>>> just like freeing a normal folio via folio_put() or free_frozen_pages(),
> >>>>> then the page allocator will add only healthy pages to the freelist and
> >>>>> isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> >>>>> which is too fragile.
> >>>>
> >>>> Yes, I think it should be handled by the page allocator.  There may be
> >>>
> >>> I agree with Matthew, Harry, and David. The page allocator seems best
> >>> suited to handle HWPoison subpages without any new folio allocations.
> >>
> >> Sorry I should have been clearer. I don't think adding an **explicit**
> >> interface to free an hwpoison folio is worth; instead implicitly
> >> handling during freeing of a folio seems more feasible.
> >
> > That's fine with me, just more to be taken care of by page allocator.
> >
> >>
> >>>> some complexity to this that I've missed, eg if hugetlb wants to retain
> >>>> the good 2MB chunks of a 1GB allocation.  I'm not sure that's a useful
> >>>> thing to do or not.
> >>>>
> >>>>> In fact, that can be done by teaching free_pages_prepare() how to handle
> >>>>> the case where one or more subpages of a folio are hwpoison pages.
> >>>>>
> >>>>> How this should be implemented in the page allocator in memdescs world?
> >>>>> Hmm, we'll want to do some kind of non-uniform split, without actually
> >>>>> splitting the folio but allocating struct buddy?
> >>>>
> >>>> Let me sketch that out, realising that it's subject to change.
> >>>>
> >>>> A page in buddy state can't need a memdesc allocated.  Otherwise we're
> >>>> allocating memory to free memory, and that way lies madness.  We can't
> >>>> do the hack of "embed struct buddy in the page that we're freeing"
> >>>> because HIGHMEM.  So we'll never shrink struct page smaller than struct
> >>>> buddy (which is fine because I've laid out how to get to a 64 bit struct
> >>>> buddy, and we're probably two years from getting there anyway).
> >>>>
> >>>> My design for handling hwpoison is that we do allocate a struct hwpoison
> >>>> for a page.  It looks like this (for now, in my head):
> >>>>
> >>>> struct hwpoison {
> >>>>         memdesc_t original;
> >>>>         ... other things ...
> >>>> };
> >>>>
> >>>> So we can replace the memdesc in a page with a hwpoison memdesc when we
> >>>> encounter the error.  We still need a folio flag to indicate that "this
> >>>> folio contains a page with hwpoison".  I haven't put much thought yet
> >>>> into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
> >>>> includes an index of where the actually poisoned page is in the folio,
> >>>> so it doesn't matter if the pages alias with each other as we can recover
> >>>> the information when it becomes useful to do so.
> >>>>
> >>>>> But... for now I think hiding this complexity inside the page allocator
> >>>>> is good enough. For now this would just mean splitting a frozen page
> >>>
> >>> I want to add one more thing. For HugeTLB, kernel clears the HWPoison
> >>> flag on the folio and move it to every raw pages in raw_hwp_page list
> >>> (see folio_clear_hugetlb_hwpoison). So page allocator has no hint that
> >>> some pages passed into free_frozen_pages has HWPoison. It has to
> >>> traverse 2^order pages to tell, if I am not mistaken, which goes
> >>> against the past effort to reduce sanity checks. I believe this is one
> >>> reason I choosed to handle the problem in hugetlb / memory-failure.
> >>
> >> I think we can skip calling folio_clear_hugetlb_hwpoison() and teach the
> >
> > Nit: also skip folio_free_raw_hwp so the hugetlb-specific llist
> > containing the raw pages and owned by memory-failure is preserved? And
> > expect the page allocator to use it for whatever purpose then free the
> > llist? Doesn't seem to follow the correct ownership rule.
> >
> >> buddy allocator to handle this. free_pages_prepare() already handles
> >> (PageHWPoison(page) && !order) case, we just need to extend that to
> >> support hugetlb folios as well.
> >>
> >>> For the new interface Harry requested, is it the caller's
> >>> responsibility to ensure that the folio contains HWPoison pages (to be
> >>> even better, maybe point out the exact ones?), so that page allocator
> >>> at least doesn't waste cycles to search non-exist HWPoison in the set
> >>> of pages?
> >>
> >> With implicit handling it would be the page allocator's responsibility
> >> to check and handle hwpoison hugetlb folios.
> >
> > Does this mean we must bake hugetlb-specific logic in the page
> > allocator's freeing path? AFAICT today the contract in
> > free_frozen_page doesn't contain much hugetlb info.
> >
> > I saw there is already some hugetlb-specific logic in page_alloc.c,
> > but perhaps not valid for adding more.
> >
> >>
> >>> Or caller and page allocator need to agree on some contract? Say
> >>> caller has to set has_hwpoisoned flag in non-zero order folio to free.
> >>> This allows the old interface free_frozen_pages an easy way using the
> >>> has_hwpoison flag from the second page. I know has_hwpoison is "#if
> >>> defined" on THP and using it for hugetlb probably is not very clean,
> >>> but are there other concerns?
> >>
> >> As you mentioned has_hwpoisoned is used for THPs and for a hugetlb
> >> folio. But for a hugetlb folio folio_test_hwpoison() returns true
> >> if it has at least one hwpoison pages (assuming that we don't clear it
> >> before freeing).
> >>
> >> So in free_pages_prepare():
> >>
> >> if (folio_test_hugetlb(folio) && folio_test_hwpoison(folio)) {
> >>   /*
> >>    * Handle hwpoison hugetlb folios; transfer the error information
> >>    * to individual pages, clear hwpoison flag of the folio,
> >>    * perform non-uniform split on the frozen folio.
> >>    */
> >> } else if (PageHWPoison(page) && !order) {
> >>   /* We already handle this in the allocator. */
> >> }
> >>
> >> This would be sufficient?
> >
> > Wouldn't this confuse the page allocator into thinking the healthy
> > head page is HWPoison (when it actually isn't)? I thought that was one
> > of the reasons has_hwpoison exists.

AFAICT in the current code we don't set PG_hwpoison on individual
pages for hugetlb folios, so it won't confuse the page allocator.

> Is there a reason why hugetlb does not use has_hwpoison flag?

But yeah sounds like hugetlb is quite special here :)

I don't see why we should not use has_hwpoisoned and I think it's fine
to set has_hwpoisoned on hwpoison hugetlb folios in
folio_clear_hugetlb_hwpoison() and check the flag in the page allocator!

And since the split code has to scan base pages to check if there
is an hwpoison page in the new folio created by split (as Zi Yan mentioned),
I think it's fine to not skip calling folio_free_raw_hwp() in
folio_clear_hugetlb_hwpoison() and set has_hwpoisoned instead, and then
scan pages in free_pages_prepare() when we know has_hwpoisoned is set.

That should address Jiaqi's concern on adding hugetlb-specific code
in the page allocator.

So summing up:

1. Transfer raw hwp list to individual pages by setting PG_hwpoison
   (that's done in folio_clear_hugetlb_hwpoison()->folio_free_raw_hwp()!)

2. Set has_hwpoisoned in folio_clear_hugetlb_hwpoison()

3. Check has_hwpoisoned in free_pages_prepare() and if it's set,
   iterate over all base pages and do non-uniform split by calling
   __split_unmapped_folio() at each hwpoisoned pages.

   I think it's fine to iterate over base pages and check hwpoison flag
   as long as we do that only when we know there's an hwpoison page.

   But maybe we need to dispatch the job to a workqueue as Zi Yan said,
   because it'll take a while to iterate 512 * 512 pages when we're freeing
   1GB hugetlb folios.

4. Optimize __split_unmapped_folio() as suggested by Zi Yan below.

BTW I think we have to discard folios that has hwpoison pages
when we fail to split some parts? (we don't have to discard all of them,
but we may have managed to split some parts while other parts failed)

-- 
Cheers,
Harry / Hyeonggon

> BTW, __split_unmapped_folio() currently sets has_hwpoison to the after-split
> folios by scanning every single page in the to-be-split folio.
>
> The related code is in __split_folio_to_order(). But the code is not
> efficient for non-uniform split, since it calls __split_folio_to_order()
> multiple time, meaning when non-uniform split order-N to order-0,
> 2^(N-1) pages are scanned once, 2^(N-2) pages are scanned twice,
> 2^(N-3) pages are scanned 3 times, ..., 4 pages are scanned N-4 times.
> It can be optimized with some additional code in __split_folio_to_order().
> 
> Something like the patch below, it assumes PageHWPoison(split_at) == true:
> 
> From 219466f5d5edc4e8bf0e5402c5deffb584c6a2a0 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@...dia.com>
> Date: Tue, 18 Nov 2025 14:55:36 -0500
> Subject: [PATCH] mm/huge_memory: optimize hwpoison page scan
> 
> Signed-off-by: Zi Yan <ziy@...dia.com>
> ---
>  mm/huge_memory.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d716c6965e27..54a933a20f1b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3233,8 +3233,11 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>  					caller_pins;
>  }
> 
> -static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
> +static bool page_range_has_hwpoisoned(struct page *page, long nr_pages, struct page *donot_scan)
>  {
> +	if (donot_scan && donot_scan >= page && donot_scan < page + nr_pages)
> +		return false;
> +
>  	for (; nr_pages; page++, nr_pages--)
>  		if (PageHWPoison(page))
>  			return true;
> @@ -3246,7 +3249,7 @@ static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>   * all the resulting folios.
>   */
>  static void __split_folio_to_order(struct folio *folio, int old_order,
> -		int new_order)
> +		int new_order, struct page *donot_scan)
>  {
>  	/* Scan poisoned pages when split a poisoned folio to large folios */
>  	const bool handle_hwpoison = folio_test_has_hwpoisoned(folio) && new_order;
> @@ -3258,7 +3261,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
> 
>  	/* Check first new_nr_pages since the loop below skips them */
>  	if (handle_hwpoison &&
> -	    page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages))
> +	    page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages, donot_scan))
>  		folio_set_has_hwpoisoned(folio);
>  	/*
>  	 * Skip the first new_nr_pages, since the new folio from them have all
> @@ -3308,7 +3311,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  				 LRU_GEN_MASK | LRU_REFS_MASK));
> 
>  		if (handle_hwpoison &&
> -		    page_range_has_hwpoisoned(new_head, new_nr_pages))
> +		    page_range_has_hwpoisoned(new_head, new_nr_pages, donot_scan))
>  			folio_set_has_hwpoisoned(new_folio);
> 
>  		new_folio->mapping = folio->mapping;
> @@ -3438,7 +3441,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  		folio_split_memcg_refs(folio, old_order, split_order);
>  		split_page_owner(&folio->page, old_order, split_order);
>  		pgalloc_tag_split(folio, old_order, split_order);
> -		__split_folio_to_order(folio, old_order, split_order);
> +		__split_folio_to_order(folio, old_order, split_order, uniform_split ? NULL : split_at);
> 
>  		if (is_anon) {
>  			mod_mthp_stat(old_order, MTHP_STAT_NR_ANON, -1);
> -- 
> 2.51.0
> 
> >> Or do we want to handle THPs as well, in case of split failure in
> >> memory_failure()? if so we need to handle folio_test_has_hwpoisoned()
> >> case as well...
> >
> > Yeah, I think this is another good use case for our request to page allocator.
> >
> >>
> >>>>> inside the page allocator (probably non-uniform?). We can later re-implement
> >>>>> this to provide better support for memdescs.
> >>>>
> >>>> Yes, I like this approach.  But then I'm not the page allocator
> >>>> maintainer ;-)
> >>>
> >>> If page allocator maintainers can weigh in here, that will be very helpful!
> >>
> >> Yeah, I'm not a maintainer either ;) it'll be great to get opinions
> >> from page allocator folks!
> 
> I think this is a good approach as long as it does not add too much overhead
> on the page freeing path. Otherwise dispatch the job to a workqueue?
> 
> Best Regards,
> Yan, Zi