[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4391e3f5-e0a5-4920-bd50-05337b7764e7@gmail.com>
Date: Fri, 22 Aug 2025 17:50:47 +0400
From: Giorgi Tchankvetadze <giorgitchankvetadze1997@...il.com>
To: lirongqing@...du.com
Cc: akpm@...ux-foundation.org, david@...hat.com,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, muchun.song@...ux.dev,
osalvador@...e.de, xuwenjie04@...du.com
Subject: Re: [PATCH] mm/hugetlb: two-phase hugepage allocation when
reservation is high
Hi there. The 90% split is solid. Would it make sense to (a) log a
one-time warning if the second pass is triggered, so operators know why
boot slowed, and (b) make the 90% cap a Kconfig default ratio, so
distros can lower it without patching? Both are low-risk and don’t
change the ABI
Thanks
On 8/22/2025 3:28 PM, lirongqing wrote:
> From: Li RongQing <lirongqing@...du.com>
>
> When the total reserved hugepages account for 95% or more of system RAM
> (common in cloud computing on physical servers), allocating them all in one
> go can lead to OOM or fail to allocating huge page during early boot.
>
> The previous hugetlb vmemmap batching change (91f386bf0772) can worsen
> peak memory pressure under these conditions by deferring page frees,
> exacerbating allocation failures. To prevent this, split the allocation
> into two equal batches whenever
> huge_reserved_pages >= totalram_pages() * 90 / 100.
>
> This change does not alter the number of padata worker threads per batch;
> it merely introduces a second round of padata_do_multithreaded(). The added
> overhead of restarting the worker threads is minimal.
>
> Before:
> [ 8.423187] HugeTLB: allocation took 1584ms with hugepage_allocation_threads=48
> [ 8.431189] HugeTLB: allocating 385920 of page size 2.00 MiB failed. Only allocated 385296 hugepages.
>
> After:
> [ 8.740201] HugeTLB: allocation took 1900ms with hugepage_allocation_threads=48
> [ 8.748266] HugeTLB: registered 2.00 MiB page size, pre-allocated 385920 pages
>
> Fixes: 91f386bf0772 ("hugetlb: batch freeing of vmemmap pages")
>
> Co-developed-by: Wenjie Xu <xuwenjie04@...du.com>
> Signed-off-by: Wenjie Xu <xuwenjie04@...du.com>
> Signed-off-by: Li RongQing <lirongqing@...du.com>
> ---
> mm/hugetlb.c | 21 +++++++++++++++++++--
> 1 filechanged <https://lore.kernel.org/linux-mm/20250822112828.2742-1-lirongqing@baidu.com/#related>, 19 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 753f99b..a86d3a0 100644
> --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3587,12 +3587,23 @@ static
> unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h) .numa_aware = true
> };
>
> + unsigned long huge_reserved_pages = h->max_huge_pages << h->order; +
> unsigned long huge_pages, remaining, total_pages; unsigned long jiffies_start;
> unsigned long jiffies_end;
>
> + total_pages = totalram_pages() * 90 / 100; + if (huge_reserved_pages >
> total_pages) { + huge_pages = h->max_huge_pages * 90 / 100; + remaining
> = h->max_huge_pages - huge_pages; + } else { + huge_pages = h-
> >max_huge_pages; + remaining = 0; + } + job.thread_fn = hugetlb_pages_alloc_boot_node;
> job.start = 0;
> - job.size = h->max_huge_pages; + job.size = huge_pages;
> /*
> * job.max_threads is 25% of the available cpu threads by default.
> @@ -3616,10 +3627,16 @@ static unsigned long __init
> hugetlb_pages_alloc_boot(struct hstate *h) }
>
> job.max_threads = hugepage_allocation_threads;
> - job.min_chunk = h->max_huge_pages / hugepage_allocation_threads; +
> job.min_chunk = huge_pages / hugepage_allocation_threads;
> jiffies_start = jiffies;
> padata_do_multithreaded(&job);
> + if (remaining) { + job.start = huge_pages; + job.size = remaining; +
> job.min_chunk = remaining / hugepage_allocation_threads; +
> padata_do_multithreaded(&job); + } jiffies_end = jiffies;
>
> pr_info("HugeTLB: allocation took %dms with hugepage_allocation_threads=%ld\n",
> --
> 2.9.4
>
>
Powered by blists - more mailing lists