linux-kernel - Re: [PATCH v6 1/5] drivers/base/node: Optimize memory block registration to reduce boot time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <456ea698-aced-4f65-b5ba-37640632360e@linux.ibm.com>
Date: Mon, 26 May 2025 23:11:32 +0530
From: Donet Tom <donettom@...ux.ibm.com>
To: David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mike Rapoport <rppt@...nel.org>, Oscar Salvador <osalvador@...e.de>,
        Zi Yan <ziy@...dia.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: Ritesh Harjani <ritesh.list@...il.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, "Rafael J . Wysocki" <rafael@...nel.org>,
        Danilo Krummrich <dakr@...nel.org>,
        Jonathan Cameron <Jonathan.Cameron@...wei.com>,
        Alison Schofield <alison.schofield@...el.com>,
        Yury Norov <yury.norov@...il.com>, Dave Jiang <dave.jiang@...el.com>,
        Madhavan Srinivasan <maddy@...ux.ibm.com>,
        Nilay Shroff
 <nilay@...ux.ibm.com>, linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH v6 1/5] drivers/base/node: Optimize memory block
 registration to reduce boot time


On 5/26/25 10:47 PM, David Hildenbrand wrote:
> On 26.05.25 16:50, Donet Tom wrote:
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified 
>> using
>> the node’s start and end PFNs, which are obtained from the node's 
>> pg_data
>>
>> However, not all PFNs within this range necessarily belong to the same
>> node—some may belong to other nodes. Additionally, due to the
>> discontiguous nature of physical memory, certain sections within a
>> `memory block` may be absent.
>>
>> As a result, `memory blocks` that fall between a node’s start and end
>> PFNs may span across multiple nodes, and some sections within those 
>> blocks
>> may be missing. `Memory blocks` have a fixed size, which is architecture
>> dependent.
>>
>> Due to these considerations, the memory block registration is currently
>> performed as follows:
>>
>> for_each_online_node(nid):
>>      start_pfn = pgdat->node_start_pfn;
>>      end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>>      for_each_memory_block_between(PFN_PHYS(start_pfn), 
>> PFN_PHYS(end_pfn))
>>          mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>>          pfn_mb_end = pfn_start + memory_block_pfns - 1
>>          for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>>              if (get_nid_for_pfn(pfn) != nid):
>>                  continue;
>>              else
>>                  do_register_memory_block_under_node(nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> Here, we derive the start and end PFNs from the node's pg_data, then
>> determine the memory blocks that may belong to the node. For each
>> `memory block` in this range, we inspect all PFNs it contains and check
>> their associated NUMA node ID. If a PFN within the block matches the
>> current node, the memory block is registered under that node.
>>
>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() 
>> performs
>> a binary search in the `memblock regions` to determine the NUMA node ID
>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>> from the struct page.
>>
>> On large systems, this process can become time-consuming, especially 
>> since
>> we iterate over each `memory block` and all PFNs within it until a 
>> match is
>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>> overhead of the binary search increases the execution time 
>> significantly,
>> potentially leading to soft lockups during boot.
>>
>> In this patch, we iterate over `memblock region` to identify the
>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>> are contiguous memory ranges, each associated with a single NUMA 
>> node, and
>> they do not span across multiple nodes.
>>
>> for_each_memory_region(r): // r => region
>>    if (!node_online(r->nid)):
>>      continue;
>>    else
>>      for_each_memory_block_between(r->base, r->base + r->size - 1):
>>        do_register_memory_block_under_node(r->nid, mem_blk, 
>> MEMINIT_EARLY);
>>
>> We iterate over all memblock regions, and if the node associated with 
>> the
>> region is online, we calculate the start and end memory blocks based 
>> on the
>> region's start and end PFNs. We then register all the memory blocks 
>> within
>> that range under the region node.
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
>>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 15.621s (kernel) - 46% Improvement
>>
>> Acked-by: Oscar Salvador <osalvador@...e.de>
>> Acked-by: Mike Rapoport (Microsoft) <rppt@...nel.org>
>> Acked-by: Zi Yan <ziy@...dia.com>
>> Signed-off-by: Donet Tom <donettom@...ux.ibm.com>
>>
>
>
> Only a couple of nits:
>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index cd13ef287011..20b6f4496e1b 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/pm_runtime.h>
>>   #include <linux/swap.h>
>>   #include <linux/slab.h>
>> +#include <linux/memblock.h>
>>     static const struct bus_type node_subsys = {
>>       .name = "node",
>> @@ -850,6 +851,47 @@ void unregister_memory_block_under_nodes(struct 
>> memory_block *mem_blk)
>> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>>   }
>>   +/*
>> + * register_memory_blocks_under_nodes : Register the memory blocks
> > + *                 under the nodes.> + *
>> + * This function registers all memory blocks to their corresponding 
>> nodes
>> + * based on the associated memory regions. Each memory region is 
>> tied to
>> + * a specific node and does not span multiple nodes. Therefore, all 
>> memory
>> + * blocks within a given region are considered to belong to that 
>> node. The
>> + * function iterates through each memory region and registers the 
>> memory
>> + * blocks contained within that region to the respective node. Since 
>> memory
>> + * blocks can span across multiple regions (and hence multiple 
>> nodes), a
>> + * single memory block may be registered under more than one node if it
>> + * overlaps with regions belonging to different nodes.
>
> a) Do we need excessive doc for that?
>
> b) It looks partially like kerneldoc, do we want to convert it to 
> proper one?
>
> /**
>  * register_memory_blocks_under_nodes - register all memory blocks
>  *                     under the corresponding nodes
>  *
>  ...
>
> c) Maybe add a line break .. or two to make it a bit more readable.


Sure David, I will change it to a proper comment.


>
> > + */> +static void register_memory_blocks_under_nodes(void)
>> +{
>> +    struct memblock_region *r;
>> +
>> +    for_each_mem_region(r) {
>> +        const unsigned long start_block_id = phys_to_block_id(r->base);
>> +        const unsigned long end_block_id = phys_to_block_id(r->base 
>> + r->size - 1);
>> +        unsigned long block_id;
>> +        int nid = memblock_get_region_node(r);
>
> const int nid = memblock_get_region_node(r);
> unsigned long block_id;


Sure. I will change it.


>
>> +
>> +        if (!node_online(nid))
>> +            continue;
>> +
>> +        for (block_id = start_block_id; block_id <= end_block_id; 
>> block_id++) {
>> +            struct memory_block *mem;
>> +
>> +            mem = find_memory_block_by_id(block_id);
>> +            if (!mem)
>> +                continue;
>> +
>> +            do_register_memory_block_under_node(nid, mem, 
>> MEMINIT_EARLY);
>> +            put_device(&mem->dev);
>> +        }
>> +
>> +    }
>> +}
>> +
>>   void register_memory_blocks_under_node(int nid, unsigned long 
>> start_pfn,
>>                          unsigned long end_pfn,
>>                          enum meminit_context context)
>> @@ -971,11 +1013,16 @@ void __init node_dev_init(void)
>>         /*
>>        * Create all node devices, which will properly link the node
>> -     * to applicable memory block devices and already created cpu 
>> devices.
>> +     * to already created cpu devices.
>>        */
>>       for_each_online_node(i) {
>> -        ret = register_one_node(i);
>> +        ret =  __register_one_node(i);
>>           if (ret)
>>               panic("%s() failed to add node: %d\n", __func__, ret);
>>       }
>> +
>> +    /*
>> +     * Link the node to memory block devices
>> +     */
>
>
> This comment is rather ... superfluous.  ... and it would fit into a 
> single line.


I will remove it.


>
>> +    register_memory_blocks_under_nodes();
>>   }
>> diff --git a/include/linux/memory.h b/include/linux/memory.h
>> index 12daa6ec7d09..2a61088e17ad 100644
>> --- a/include/linux/memory.h
>> +++ b/include/linux/memory.h
>> @@ -171,12 +171,30 @@ struct memory_group 
>> *memory_group_find_by_id(int mgid);
>>   typedef int (*walk_memory_groups_func_t)(struct memory_group *, 
>> void *);
>>   int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t 
>> func,
>>                      struct memory_group *excluded, void *arg);
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
>>   #define hotplug_memory_notifier(fn, pri) ({        \
>>       static __meminitdata struct notifier_block fn##_mem_nb =\
>>           { .notifier_call = fn, .priority = pri };\
>>       register_memory_notifier(&fn##_mem_nb);            \
>>   })
>>   +extern int sections_per_block;
>> +
>> +static inline unsigned long memory_block_id(unsigned long section_nr)
>> +{
>> +    return section_nr / sections_per_block;
>> +}
>> +
>> +static inline unsigned long pfn_to_block_id(unsigned long pfn)
>> +{
>> +    return memory_block_id(pfn_to_section_nr(pfn));
>> +}
>> +
>> +static inline unsigned long phys_to_block_id(unsigned long phys)
>> +{
>> +    return pfn_to_block_id(PFN_DOWN(phys));
>> +}
>> +
>>   #ifdef CONFIG_NUMA
>>   void memory_block_add_nid(struct memory_block *mem, int nid,
>>                 enum meminit_context context);
>> @@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block 
>> *mem, int nid,
>>    * can sleep.
>>    */
>>   extern struct mutex text_mutex;
>> -
>
> Unrelated change.
>
>
> Thanks
>
> Acked-by: David Hildenbrand <david@...hat.com>


Thank you.