[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96f7d3a2-2d85-442c-a9f7-e558d4a2ba06@redhat.com>
Date: Wed, 4 Jun 2025 11:38:34 +0200
From: David Hildenbrand <david@...hat.com>
To: Donet Tom <donettom@...ux.ibm.com>,
Andrew Morton <akpm@...ux-foundation.org>, Mike Rapoport <rppt@...nel.org>,
Oscar Salvador <osalvador@...e.de>, Zi Yan <ziy@...dia.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: Ritesh Harjani <ritesh.list@...il.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, "Rafael J . Wysocki" <rafael@...nel.org>,
Danilo Krummrich <dakr@...nel.org>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>,
Alison Schofield <alison.schofield@...el.com>,
Yury Norov <yury.norov@...il.com>, Dave Jiang <dave.jiang@...el.com>,
Madhavan Srinivasan <maddy@...ux.ibm.com>, Nilay Shroff
<nilay@...ux.ibm.com>, linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block
registration to reduce boot time
On 28.05.25 19:18, Donet Tom wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
>
> However, not all PFNs within this range necessarily belong to the same
> node—some may belong to other nodes. Additionally, due to the
> discontiguous nature of physical memory, certain sections within a
> `memory block` may be absent.
>
> As a result, `memory blocks` that fall between a node’s start and end
> PFNs may span across multiple nodes, and some sections within those blocks
> may be missing. `Memory blocks` have a fixed size, which is architecture
> dependent.
>
> Due to these considerations, the memory block registration is currently
> performed as follows:
>
> for_each_online_node(nid):
> start_pfn = pgdat->node_start_pfn;
> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> pfn_mb_end = pfn_start + memory_block_pfns - 1
> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> if (get_nid_for_pfn(pfn) != nid):
> continue;
> else
> do_register_memory_block_under_node(nid, mem_blk,
> MEMINIT_EARLY);
>
> Here, we derive the start and end PFNs from the node's pg_data, then
> determine the memory blocks that may belong to the node. For each
> `memory block` in this range, we inspect all PFNs it contains and check
> their associated NUMA node ID. If a PFN within the block matches the
> current node, the memory block is registered under that node.
>
> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> a binary search in the `memblock regions` to determine the NUMA node ID
> for a given PFN. If it is not enabled, the node ID is retrieved directly
> from the struct page.
>
> On large systems, this process can become time-consuming, especially since
> we iterate over each `memory block` and all PFNs within it until a match is
> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> overhead of the binary search increases the execution time significantly,
> potentially leading to soft lockups during boot.
>
> In this patch, we iterate over `memblock region` to identify the
> `memory blocks` that belong to the current NUMA node. `memblock regions`
> are contiguous memory ranges, each associated with a single NUMA node, and
> they do not span across multiple nodes.
>
> for_each_memory_region(r): // r => region
> if (!node_online(r->nid)):
> continue;
> else
> for_each_memory_block_between(r->base, r->base + r->size - 1):
> do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
>
> We iterate over all memblock regions, and if the node associated with the
> region is online, we calculate the start and end memory blocks based on the
> region's start and end PFNs. We then register all the memory blocks within
> that range under the region node.
>
> Test Results on My system with 32TB RAM
> =======================================
> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>
> Without this patch
> ------------------
> Startup finished in 1min 16.528s (kernel)
>
> With this patch
> ---------------
> Startup finished in 17.236s (kernel) - 78% Improvement
>
> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>
> Without this patch
> ------------------
> Startup finished in 28.320s (kernel)
>
> With this patch
> ---------------
> Startup finished in 15.621s (kernel) - 46% Improvement
>
> Acked-by: David Hildenbrand <david@...hat.com>
> Acked-by: Oscar Salvador <osalvador@...e.de>
> Acked-by: Mike Rapoport (Microsoft) <rppt@...nel.org>
> Acked-by: Zi Yan <ziy@...dia.com>
> Signed-off-by: Donet Tom <donettom@...ux.ibm.com>
>
> ---
[...]
> #ifdef CONFIG_NUMA
> void memory_block_add_nid(struct memory_block *mem, int nid,
> enum meminit_context context);
> @@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
> * can sleep.
> */
> extern struct mutex text_mutex;
> -
^ Nit: unrelated change?
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists