linux-kernel - Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <09f2a2bb-75ab-44e1-9e27-d4fa0c11a47b@redhat.com>
Date: Tue, 11 Mar 2025 10:29:27 +0100
From: David Hildenbrand <david@...hat.com>
To: Donet Tom <donettom@...ux.ibm.com>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: linux-kernel@...r.kernel.org, Ritesh Harjani <ritesh.list@...il.com>,
 "Rafael J . Wysocki" <rafael@...nel.org>, Danilo Krummrich <dakr@...nel.org>
Subject: Re: [PATCH] driver/base/node.c: Fix softlockups during the
 initialization of large systems with interleaved memory blocks

On 11.03.25 09:56, Donet Tom wrote:
> 
> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>> are interleaved, node initialization (node_dev_init()) could take
>>> a long time since it iterates over each memory block. If the memory
>>> block belongs to the current iterating node, the first pfn_to_nid
>>> will provide the correct value. Otherwise, it will iterate over all
>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>> is enabled in kernels now [1], we may still need to fix older
>>> stable kernels to avoid encountering these kernel warnings during
>>> boot.
>>>
>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>> this warning.
>>>
>>> node_dev_init()
>>>       register_one_node
>>>           register_memory_blocks_under_node
>>>               walk_memory_blocks()
>>>                   register_mem_block_under_node_early
>>>                       get_nid_for_pfn
>>>                           early_pfn_to_nid
>>>
>>> In my system node4 has a memory block ranging from memory30351
>>> to memory38524, and memory128433. The memory blocks between
>>> memory38524 and memory128433 do not belong to this node.
>>>
>>> In  walk_memory_blocks() we iterate over all memblocks starting
>>> from memory38524 to memory128433.
>>> In register_mem_block_under_node_early(), up to memory38524, the
>>> first pfn correctly returns the corresponding nid and the function
>>> returns from there. But after memory38524 and until memory128433,
>>> the loop iterates through each pfn and checks the nid. Since the nid
>>> does not match the required nid, the loop continues. This causes
>>> the soft lockups.
>>>
>>> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
>>> Signed-off-by: Donet Tom <donettom@...ux.ibm.com>
>>> ---
>>>    drivers/base/node.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>> index 0ea653fa3433..107eb508e28e 100644
>>> --- a/drivers/base/node.c
>>> +++ b/drivers/base/node.c
>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>>    		ret = register_one_node(i);
>>>    		if (ret)
>>>    			panic("%s() failed to add node: %d\n", __func__, ret);
>>> +		cond_resched();
>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>> random locations, especially as this is actually fixed by using a
>> different scheduler model as you say.
>>
>> Why not just make the code faster so as to avoid the long time this
>> takes?
> 
> 
> Thanks Greg
> 
> I was checking the code to see how to make it faster in order to
> avoid the long time it takes.
> 
> In below code path
> 
> register_one_node()
>       register_memory_blocks_under_node()
>           walk_memory_blocks()
>               register_mem_block_under_node_early()
> 
> walk_memory_blocks() is iterating over all memblocks, and
> register_mem_block_under_node_early() is iterating over the PFNs
> to find the page_nid
> 
> If the page_nid and the requested nid are the same, we will register
> the memblock under the node and return.
> 
> But if get_nid_for_pfn() returns a different nid (This means the
> memblock is part of different nid), then the loop will iterate
> over all PFNs of the memblock and check if the page_nid returned by
> get_nid_for_pfn() and the requested nid are the same.
> 
> IIUC, since all PFNs of a memblock return the same page_nid, we
> should stop the loop if the page_nid returned does not match the
> requested nid.

Nodes can easily partially span "memory blocks". So your patch would 
break these setups?

But I agree that iterating all pages is rather nasty. I wonder if we 
could just walk all memblocks in the range?

early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ... 
for each PFN. Testing a range instead could be better.

Something like "early_pfn_range_spans_nid()" could be useful for that.

-- 
Cheers,

David / dhildenb