[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2a3683e5-0812-4333-b793-3180c61b8bc7@redhat.com>
Date: Tue, 11 Mar 2025 20:39:05 +0100
From: David Hildenbrand <david@...hat.com>
To: Donet Tom <donettom@...ux.ibm.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: linux-kernel@...r.kernel.org, Ritesh Harjani <ritesh.list@...il.com>,
"Rafael J . Wysocki" <rafael@...nel.org>, Danilo Krummrich <dakr@...nel.org>
Subject: Re: [PATCH] driver/base/node.c: Fix softlockups during the
initialization of large systems with interleaved memory blocks
On 11.03.25 16:00, Donet Tom wrote:
>
> On 3/11/25 2:59 PM, David Hildenbrand wrote:
>> On 11.03.25 09:56, Donet Tom wrote:
>>>
>>> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>>>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>>>> are interleaved, node initialization (node_dev_init()) could take
>>>>> a long time since it iterates over each memory block. If the memory
>>>>> block belongs to the current iterating node, the first pfn_to_nid
>>>>> will provide the correct value. Otherwise, it will iterate over all
>>>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>>>> is enabled in kernels now [1], we may still need to fix older
>>>>> stable kernels to avoid encountering these kernel warnings during
>>>>> boot.
>>>>>
>>>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>>>> this warning.
>>>>>
>>>>> node_dev_init()
>>>>> register_one_node
>>>>> register_memory_blocks_under_node
>>>>> walk_memory_blocks()
>>>>> register_mem_block_under_node_early
>>>>> get_nid_for_pfn
>>>>> early_pfn_to_nid
>>>>>
>>>>> In my system node4 has a memory block ranging from memory30351
>>>>> to memory38524, and memory128433. The memory blocks between
>>>>> memory38524 and memory128433 do not belong to this node.
>>>>>
>>>>> In walk_memory_blocks() we iterate over all memblocks starting
>>>>> from memory38524 to memory128433.
>>>>> In register_mem_block_under_node_early(), up to memory38524, the
>>>>> first pfn correctly returns the corresponding nid and the function
>>>>> returns from there. But after memory38524 and until memory128433,
>>>>> the loop iterates through each pfn and checks the nid. Since the nid
>>>>> does not match the required nid, the loop continues. This causes
>>>>> the soft lockups.
>>>>>
>>>>> [1]:
>>>>> https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device
>>>>> subsystem initialization in node_dev_init()")
>>>>> Signed-off-by: Donet Tom <donettom@...ux.ibm.com>
>>>>> ---
>>>>> drivers/base/node.c | 1 +
>>>>> 1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>>> index 0ea653fa3433..107eb508e28e 100644
>>>>> --- a/drivers/base/node.c
>>>>> +++ b/drivers/base/node.c
>>>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>>>> ret = register_one_node(i);
>>>>> if (ret)
>>>>> panic("%s() failed to add node: %d\n", __func__, ret);
>>>>> + cond_resched();
>>>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>>>> random locations, especially as this is actually fixed by using a
>>>> different scheduler model as you say.
>>>>
>>>> Why not just make the code faster so as to avoid the long time this
>>>> takes?
>>>
>>>
>>> Thanks Greg
>>>
>>> I was checking the code to see how to make it faster in order to
>>> avoid the long time it takes.
>>>
>>> In below code path
>>>
>>> register_one_node()
>>> register_memory_blocks_under_node()
>>> walk_memory_blocks()
>>> register_mem_block_under_node_early()
>>>
>>> walk_memory_blocks() is iterating over all memblocks, and
>>> register_mem_block_under_node_early() is iterating over the PFNs
>>> to find the page_nid
>>>
>>> If the page_nid and the requested nid are the same, we will register
>>> the memblock under the node and return.
>>>
>>> But if get_nid_for_pfn() returns a different nid (This means the
>>> memblock is part of different nid), then the loop will iterate
>>> over all PFNs of the memblock and check if the page_nid returned by
>>> get_nid_for_pfn() and the requested nid are the same.
>>>
>>> IIUC, since all PFNs of a memblock return the same page_nid, we
>>> should stop the loop if the page_nid returned does not match the
>>> requested nid.
>>
>> Nodes can easily partially span "memory blocks". So your patch would
>> break these setups?
>
>
> Does this mean one memory block can be part of two or
> more nodes ? Some PFNs belong to one node, and the remaining
> PFNs belong to another node?"
Exactly.
Consider the following qemu cmdline as one example:
qemu-system-x86_64 --enable-kvm -smp 10 -M q35 -m 4G -hda
Fedora-Server-KVM-40-1.14.x86_64.qcow2 -nographic -object
memory-backend-ram,size=2000M,id=mem0 -object
memory-backend-ram,size=2096M,id=mem1 -numa node,cpus=0-4,memdev=mem0
-numa node,cpus=5-9,memdev=mem1
Inside the VM:
[root@...alhost ~]# ls /sys/devices/system/node/node0/
compact cpu4 meminfo memory12 memory3 memory8 subsystem
cpu0 cpulist memory0 memory13 memory4 memory9 uevent
cpu1 cpumap memory1 memory14 memory5 memory_failure vmstat
cpu2 distance memory10 memory15 memory6 numastat x86
cpu3 hugepages memory11 memory2 memory7 power
[root@...alhost ~]# ls /sys/devices/system/node/node1/
compact cpu9 meminfo memory35 memory40 memory45 power
cpu5 cpulist memory15 memory36 memory41 memory46 subsystem
cpu6 cpumap memory32 memory37 memory42 memory47 uevent
cpu7 distance memory33 memory38 memory43 memory_failure vmstat
cpu8 hugepages memory34 memory39 memory44 numastat x86
Observer how memory15 shows up for both nodes.
[root@...alhost ~]# ls /sys/devices/system/memory/memory15/
node0 online phys_index removable subsystem valid_zones
node1 phys_device power state uevent
Observe how both nodes are listed
[root@...alhost ~]# cat /sys/devices/system/memory/memory15/valid_zones
none
And "valid_zone = none" indicates that this memory block cannot get
offlined because it spans multiple zones (here: from multiple nodes)
>
> In that case, the current implementation adds the memory block to
> only one node. In register_mem_block_under_node_early(), if the
> first PFN returns the correct expected nid, the memory block will
> be registered under that node. It does not iterate over the other
> PFNs. Is this because of the assumption that one memory block
> cannot be part of multiple nodes?
See my example above. But note that my test VM has
]# uname -a
Linux localhost.localdomain 6.11.10-200.fc40.x86_64 #1 SMP
PREEMPT_DYNAMIC Sat Nov 23 00:53:13 UTC 2024 x86_64 GNU/Linux
>
> If one memory block cannot be part of multiple nodes, then we can
> break if get_nid_for_pfn() returns the wrong nid, right?
Again, see above. Hopefully that makes it clearer.
>
>
>>
>> But I agree that iterating all pages is rather nasty. I wonder if we
>> could just walk all memblocks in the range?
>>
>> early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ...
>> for each PFN. Testing a range instead could be better.
>>
>> Something like "early_pfn_range_spans_nid()" could be useful for that.
>
> Do you mean we should do it section by section within a memory block?
All we want to know is if the memblock allocator ("early") thinks that
any part of the memory block device (memory_block_size_bytes()) belongs
to that node.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists