[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z_Uqiu75bXhqpwm4@localhost.localdomain>
Date: Tue, 8 Apr 2025 15:54:18 +0200
From: Oscar Salvador <osalvador@...e.de>
To: Frank van der Linden <fvdl@...gle.com>
Cc: akpm@...ux-foundation.org, muchun.song@...ux.dev, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, david@...hat.com, luizcap@...hat.com
Subject: Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
On Wed, Apr 02, 2025 at 08:56:13PM +0000, Frank van der Linden wrote:
> Hugetlb boot allocation has used online nodes for allocation since
> commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
> allocation"). This was needed to be able to do the allocations
> earlier in boot, before N_MEMORY was set.
>
> This might lead to a different distribution of gigantic hugepages
> across NUMA nodes if there are memoryless nodes in the system.
>
> What happens is that the memoryless nodes are tried, but then
> the memblock allocation fails and falls back, which usually means
> that the node that has the highest physical address available
> will be used (top-down allocation). While this will end up
> getting the same number of hugetlb pages, they might not be
> be distributed the same way. The fallback for each memoryless
> node might not end up coming from the same node as the
> successful round-robin allocation from N_MEMORY nodes.
>
> While administrators that rely on having a specific number of
> hugepages per node should use the hugepages=N:X syntax, it's
> better not to change the old behavior for the plain hugepages=N
> case.
>
> To do this, construct a nodemask for hugetlb bootmem purposes
> only, containing nodes that have memory. Then use that
> for round-robin bootmem allocations.
>
> This saves some cycles, and the added advantage here is that
> hugetlb_cma can use it too, avoiding the older issue of
> pointless attempts to create a CMA area for memoryless nodes
> (which will also cause the per-node CMA area size to be too
> small).
Hi Frank,
Makes sense.
There something I do not quite understand though
> @@ -5012,7 +5039,6 @@ void __init hugetlb_bootmem_alloc(void)
>
> for_each_hstate(h) {
> h->next_nid_to_alloc = first_online_node;
> - h->next_nid_to_free = first_online_node;
Why are you unsetting next_nid_to_free? I guess it is because
we do not use it during boot time and you already set it to
first_memory_node further down the road in hugetlb_init_hstates.
And the reason you are leaving next_nid_to_alloc set is to see if
there is any chance that first_online_node is part of hugetlb_bootmem_nodes?
--
Oscar Salvador
SUSE Labs
Powered by blists - more mailing lists