linux-kernel - Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b426c9f0-9d63-4764-825c-6d95e89353c0@linux.ibm.com>
Date: Tue, 11 Mar 2025 20:30:19 +0530
From: Donet Tom <donettom@...ux.ibm.com>
To: David Hildenbrand <david@...hat.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: linux-kernel@...r.kernel.org, Ritesh Harjani <ritesh.list@...il.com>,
        "Rafael J . Wysocki" <rafael@...nel.org>,
        Danilo Krummrich <dakr@...nel.org>
Subject: Re: [PATCH] driver/base/node.c: Fix softlockups during the
 initialization of large systems with interleaved memory blocks


On 3/11/25 2:59 PM, David Hildenbrand wrote:
> On 11.03.25 09:56, Donet Tom wrote:
>>
>> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>>> are interleaved, node initialization (node_dev_init()) could take
>>>> a long time since it iterates over each memory block. If the memory
>>>> block belongs to the current iterating node, the first pfn_to_nid
>>>> will provide the correct value. Otherwise, it will iterate over all
>>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>>> is enabled in kernels now [1], we may still need to fix older
>>>> stable kernels to avoid encountering these kernel warnings during
>>>> boot.
>>>>
>>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>>> this warning.
>>>>
>>>> node_dev_init()
>>>>       register_one_node
>>>>           register_memory_blocks_under_node
>>>>               walk_memory_blocks()
>>>>                   register_mem_block_under_node_early
>>>>                       get_nid_for_pfn
>>>>                           early_pfn_to_nid
>>>>
>>>> In my system node4 has a memory block ranging from memory30351
>>>> to memory38524, and memory128433. The memory blocks between
>>>> memory38524 and memory128433 do not belong to this node.
>>>>
>>>> In  walk_memory_blocks() we iterate over all memblocks starting
>>>> from memory38524 to memory128433.
>>>> In register_mem_block_under_node_early(), up to memory38524, the
>>>> first pfn correctly returns the corresponding nid and the function
>>>> returns from there. But after memory38524 and until memory128433,
>>>> the loop iterates through each pfn and checks the nid. Since the nid
>>>> does not match the required nid, the loop continues. This causes
>>>> the soft lockups.
>>>>
>>>> [1]: 
>>>> https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device 
>>>> subsystem initialization in node_dev_init()")
>>>> Signed-off-by: Donet Tom <donettom@...ux.ibm.com>
>>>> ---
>>>>    drivers/base/node.c | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index 0ea653fa3433..107eb508e28e 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>>>            ret = register_one_node(i);
>>>>            if (ret)
>>>>                panic("%s() failed to add node: %d\n", __func__, ret);
>>>> +        cond_resched();
>>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>>> random locations, especially as this is actually fixed by using a
>>> different scheduler model as you say.
>>>
>>> Why not just make the code faster so as to avoid the long time this
>>> takes?
>>
>>
>> Thanks Greg
>>
>> I was checking the code to see how to make it faster in order to
>> avoid the long time it takes.
>>
>> In below code path
>>
>> register_one_node()
>>       register_memory_blocks_under_node()
>>           walk_memory_blocks()
>>               register_mem_block_under_node_early()
>>
>> walk_memory_blocks() is iterating over all memblocks, and
>> register_mem_block_under_node_early() is iterating over the PFNs
>> to find the page_nid
>>
>> If the page_nid and the requested nid are the same, we will register
>> the memblock under the node and return.
>>
>> But if get_nid_for_pfn() returns a different nid (This means the
>> memblock is part of different nid), then the loop will iterate
>> over all PFNs of the memblock and check if the page_nid returned by
>> get_nid_for_pfn() and the requested nid are the same.
>>
>> IIUC, since all PFNs of a memblock return the same page_nid, we
>> should stop the loop if the page_nid returned does not match the
>> requested nid.
>
> Nodes can easily partially span "memory blocks". So your patch would 
> break these setups?


Does this mean one memory block can be part of two or
more nodes ? Some PFNs belong to one node, and the remaining
PFNs belong to another node?"

In that case, the current implementation adds the memory block to
only one node. In register_mem_block_under_node_early(), if the
first PFN returns the correct expected nid, the memory block will
be registered under that node. It does not iterate over the other
PFNs. Is this because of the assumption that one memory block
cannot be part of multiple nodes?

If one memory block cannot be part of multiple nodes, then we can
break if get_nid_for_pfn() returns the wrong nid, right?


>
> But I agree that iterating all pages is rather nasty. I wonder if we 
> could just walk all memblocks in the range?
>
> early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ... 
> for each PFN. Testing a range instead could be better.
>
> Something like "early_pfn_range_spans_nid()" could be useful for that.

Do you mean we should do it section by section within a memory block?

Thanks
Donet