[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a30e4786-ed0e-4460-8b95-c56ab1d790ea@linux.ibm.com>
Date: Tue, 11 Mar 2025 14:26:36 +0530
From: Donet Tom <donettom@...ux.ibm.com>
To: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
David Hildenbrand <david@...hat.com>
Cc: linux-kernel@...r.kernel.org, Ritesh Harjani <ritesh.list@...il.com>,
"Rafael J . Wysocki" <rafael@...nel.org>,
Danilo Krummrich <dakr@...nel.org>
Subject: Re: [PATCH] driver/base/node.c: Fix softlockups during the
initialization of large systems with interleaved memory blocks
On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>> On large systems with more than 64TB of DRAM, if the memory blocks
>> are interleaved, node initialization (node_dev_init()) could take
>> a long time since it iterates over each memory block. If the memory
>> block belongs to the current iterating node, the first pfn_to_nid
>> will provide the correct value. Otherwise, it will iterate over all
>> PFNs and check the nid. On non-preemptive kernels, this can result
>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>> is enabled in kernels now [1], we may still need to fix older
>> stable kernels to avoid encountering these kernel warnings during
>> boot.
>>
>> This patch adds a cond_resched() call in node_dev_init() to avoid
>> this warning.
>>
>> node_dev_init()
>> register_one_node
>> register_memory_blocks_under_node
>> walk_memory_blocks()
>> register_mem_block_under_node_early
>> get_nid_for_pfn
>> early_pfn_to_nid
>>
>> In my system node4 has a memory block ranging from memory30351
>> to memory38524, and memory128433. The memory blocks between
>> memory38524 and memory128433 do not belong to this node.
>>
>> In walk_memory_blocks() we iterate over all memblocks starting
>> from memory38524 to memory128433.
>> In register_mem_block_under_node_early(), up to memory38524, the
>> first pfn correctly returns the corresponding nid and the function
>> returns from there. But after memory38524 and until memory128433,
>> the loop iterates through each pfn and checks the nid. Since the nid
>> does not match the required nid, the loop continues. This causes
>> the soft lockups.
>>
>> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
>> Signed-off-by: Donet Tom <donettom@...ux.ibm.com>
>> ---
>> drivers/base/node.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 0ea653fa3433..107eb508e28e 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>> ret = register_one_node(i);
>> if (ret)
>> panic("%s() failed to add node: %d\n", __func__, ret);
>> + cond_resched();
> That's a horrible hack, sorry, but no, we can't sprinkle this around in
> random locations, especially as this is actually fixed by using a
> different scheduler model as you say.
>
> Why not just make the code faster so as to avoid the long time this
> takes?
Thanks Greg
I was checking the code to see how to make it faster in order to
avoid the long time it takes.
In below code path
register_one_node()
register_memory_blocks_under_node()
walk_memory_blocks()
register_mem_block_under_node_early()
walk_memory_blocks() is iterating over all memblocks, and
register_mem_block_under_node_early() is iterating over the PFNs
to find the page_nid
If the page_nid and the requested nid are the same, we will register
the memblock under the node and return.
But if get_nid_for_pfn() returns a different nid (This means the
memblock is part of different nid), then the loop will iterate
over all PFNs of the memblock and check if the page_nid returned by
get_nid_for_pfn() and the requested nid are the same.
IIUC, since all PFNs of a memblock return the same page_nid, we
should stop the loop if the page_nid returned does not match the
requested nid.
With the change below, softlockups are no longer observed.
What are your thoughts on this?
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..5ca417e8672e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -811,7 +811,7 @@ static int
register_mem_block_under_node_early(struct memory_block *mem_blk,
if (page_nid < 0)
continue;
if (page_nid != nid)
- continue;
+ break;
do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
return 0;
> thanks,
>
> greg k-h
Powered by blists - more mailing lists