linux-kernel - Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2a3683e5-0812-4333-b793-3180c61b8bc7@redhat.com>
Date: Tue, 11 Mar 2025 20:39:05 +0100
From: David Hildenbrand <david@...hat.com>
To: Donet Tom <donettom@...ux.ibm.com>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: linux-kernel@...r.kernel.org, Ritesh Harjani <ritesh.list@...il.com>,
 "Rafael J . Wysocki" <rafael@...nel.org>, Danilo Krummrich <dakr@...nel.org>
Subject: Re: [PATCH] driver/base/node.c: Fix softlockups during the
 initialization of large systems with interleaved memory blocks

On 11.03.25 16:00, Donet Tom wrote:
> 
> On 3/11/25 2:59 PM, David Hildenbrand wrote:
>> On 11.03.25 09:56, Donet Tom wrote:
>>>
>>> On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
>>>> On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
>>>>> On large systems with more than 64TB of DRAM, if the memory blocks
>>>>> are interleaved, node initialization (node_dev_init()) could take
>>>>> a long time since it iterates over each memory block. If the memory
>>>>> block belongs to the current iterating node, the first pfn_to_nid
>>>>> will provide the correct value. Otherwise, it will iterate over all
>>>>> PFNs and check the nid. On non-preemptive kernels, this can result
>>>>> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
>>>>> is enabled in kernels now [1], we may still need to fix older
>>>>> stable kernels to avoid encountering these kernel warnings during
>>>>> boot.
>>>>>
>>>>> This patch adds a cond_resched() call in node_dev_init() to avoid
>>>>> this warning.
>>>>>
>>>>> node_dev_init()
>>>>>        register_one_node
>>>>>            register_memory_blocks_under_node
>>>>>                walk_memory_blocks()
>>>>>                    register_mem_block_under_node_early
>>>>>                        get_nid_for_pfn
>>>>>                            early_pfn_to_nid
>>>>>
>>>>> In my system node4 has a memory block ranging from memory30351
>>>>> to memory38524, and memory128433. The memory blocks between
>>>>> memory38524 and memory128433 do not belong to this node.
>>>>>
>>>>> In  walk_memory_blocks() we iterate over all memblocks starting
>>>>> from memory38524 to memory128433.
>>>>> In register_mem_block_under_node_early(), up to memory38524, the
>>>>> first pfn correctly returns the corresponding nid and the function
>>>>> returns from there. But after memory38524 and until memory128433,
>>>>> the loop iterates through each pfn and checks the nid. Since the nid
>>>>> does not match the required nid, the loop continues. This causes
>>>>> the soft lockups.
>>>>>
>>>>> [1]:
>>>>> https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
>>>>> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device
>>>>> subsystem initialization in node_dev_init()")
>>>>> Signed-off-by: Donet Tom <donettom@...ux.ibm.com>
>>>>> ---
>>>>>     drivers/base/node.c | 1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>>> index 0ea653fa3433..107eb508e28e 100644
>>>>> --- a/drivers/base/node.c
>>>>> +++ b/drivers/base/node.c
>>>>> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>>>>>             ret = register_one_node(i);
>>>>>             if (ret)
>>>>>                 panic("%s() failed to add node: %d\n", __func__, ret);
>>>>> +        cond_resched();
>>>> That's a horrible hack, sorry, but no, we can't sprinkle this around in
>>>> random locations, especially as this is actually fixed by using a
>>>> different scheduler model as you say.
>>>>
>>>> Why not just make the code faster so as to avoid the long time this
>>>> takes?
>>>
>>>
>>> Thanks Greg
>>>
>>> I was checking the code to see how to make it faster in order to
>>> avoid the long time it takes.
>>>
>>> In below code path
>>>
>>> register_one_node()
>>>        register_memory_blocks_under_node()
>>>            walk_memory_blocks()
>>>                register_mem_block_under_node_early()
>>>
>>> walk_memory_blocks() is iterating over all memblocks, and
>>> register_mem_block_under_node_early() is iterating over the PFNs
>>> to find the page_nid
>>>
>>> If the page_nid and the requested nid are the same, we will register
>>> the memblock under the node and return.
>>>
>>> But if get_nid_for_pfn() returns a different nid (This means the
>>> memblock is part of different nid), then the loop will iterate
>>> over all PFNs of the memblock and check if the page_nid returned by
>>> get_nid_for_pfn() and the requested nid are the same.
>>>
>>> IIUC, since all PFNs of a memblock return the same page_nid, we
>>> should stop the loop if the page_nid returned does not match the
>>> requested nid.
>>
>> Nodes can easily partially span "memory blocks". So your patch would
>> break these setups?
> 
> 
> Does this mean one memory block can be part of two or
> more nodes ? Some PFNs belong to one node, and the remaining
> PFNs belong to another node?"

Exactly.

Consider the following qemu cmdline as one example:

qemu-system-x86_64 --enable-kvm -smp 10 -M q35 -m 4G -hda 
Fedora-Server-KVM-40-1.14.x86_64.qcow2 -nographic -object 
memory-backend-ram,size=2000M,id=mem0 -object 
memory-backend-ram,size=2096M,id=mem1 -numa node,cpus=0-4,memdev=mem0 
-numa node,cpus=5-9,memdev=mem1


Inside the VM:

[root@...alhost ~]# ls /sys/devices/system/node/node0/
compact  cpu4       meminfo   memory12  memory3  memory8         subsystem
cpu0     cpulist    memory0   memory13  memory4  memory9         uevent
cpu1     cpumap     memory1   memory14  memory5  memory_failure  vmstat
cpu2     distance   memory10  memory15  memory6  numastat        x86
cpu3     hugepages  memory11  memory2   memory7  power
[root@...alhost ~]# ls /sys/devices/system/node/node1/
compact  cpu9       meminfo   memory35  memory40  memory45        power
cpu5     cpulist    memory15  memory36  memory41  memory46        subsystem
cpu6     cpumap     memory32  memory37  memory42  memory47        uevent
cpu7     distance   memory33  memory38  memory43  memory_failure  vmstat
cpu8     hugepages  memory34  memory39  memory44  numastat        x86


Observer how memory15 shows up for both nodes.

[root@...alhost ~]# ls /sys/devices/system/memory/memory15/
node0  online       phys_index  removable  subsystem  valid_zones
node1  phys_device  power       state      uevent

Observe how both nodes are listed

[root@...alhost ~]# cat /sys/devices/system/memory/memory15/valid_zones
none

And "valid_zone = none" indicates that this memory block cannot get 
offlined because it spans multiple zones (here: from multiple nodes)

> 
> In that case, the current implementation adds the memory block to
> only one node. In register_mem_block_under_node_early(), if the
> first PFN returns the correct expected nid, the memory block will
> be registered under that node. It does not iterate over the other
> PFNs. Is this because of the assumption that one memory block
> cannot be part of multiple nodes?

See my example above. But note that my test VM has

]# uname -a
Linux localhost.localdomain 6.11.10-200.fc40.x86_64 #1 SMP 
PREEMPT_DYNAMIC Sat Nov 23 00:53:13 UTC 2024 x86_64 GNU/Linux


> 
> If one memory block cannot be part of multiple nodes, then we can
> break if get_nid_for_pfn() returns the wrong nid, right?

Again, see above. Hopefully that makes it clearer.

> 
> 
>>
>> But I agree that iterating all pages is rather nasty. I wonder if we
>> could just walk all memblocks in the range?
>>
>> early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ...
>> for each PFN. Testing a range instead could be better.
>>
>> Something like "early_pfn_range_spans_nid()" could be useful for that.
> 
> Do you mean we should do it section by section within a memory block?

All we want to know is if the memblock allocator ("early") thinks that 
any part of the memory block device (memory_block_size_bytes()) belongs 
to that node.

-- 
Cheers,

David / dhildenb