linux-kernel - Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2025031051-gab-viability-e288@gregkh>
Date: Mon, 10 Mar 2025 13:52:28 +0100
From: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
To: Donet Tom <donettom@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, David Hildenbrand <david@...hat.com>,
	Ritesh Harjani <ritesh.list@...il.com>,
	"Rafael J . Wysocki" <rafael@...nel.org>,
	Danilo Krummrich <dakr@...nel.org>
Subject: Re: [PATCH] driver/base/node.c: Fix softlockups during the
 initialization of large systems with interleaved memory blocks

On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
> On large systems with more than 64TB of DRAM, if the memory blocks
> are interleaved, node initialization (node_dev_init()) could take
> a long time since it iterates over each memory block. If the memory
> block belongs to the current iterating node, the first pfn_to_nid
> will provide the correct value. Otherwise, it will iterate over all
> PFNs and check the nid. On non-preemptive kernels, this can result
> in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
> is enabled in kernels now [1], we may still need to fix older
> stable kernels to avoid encountering these kernel warnings during
> boot.
> 
> This patch adds a cond_resched() call in node_dev_init() to avoid
> this warning.
> 
> node_dev_init()
>     register_one_node
>         register_memory_blocks_under_node
>             walk_memory_blocks()
>                 register_mem_block_under_node_early
>                     get_nid_for_pfn
>                         early_pfn_to_nid
> 
> In my system node4 has a memory block ranging from memory30351
> to memory38524, and memory128433. The memory blocks between
> memory38524 and memory128433 do not belong to this node.
> 
> In  walk_memory_blocks() we iterate over all memblocks starting
> from memory38524 to memory128433.
> In register_mem_block_under_node_early(), up to memory38524, the
> first pfn correctly returns the corresponding nid and the function
> returns from there. But after memory38524 and until memory128433,
> the loop iterates through each pfn and checks the nid. Since the nid
> does not match the required nid, the loop continues. This causes
> the soft lockups.
> 
> [1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@linux.ibm.com/
> Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
> Signed-off-by: Donet Tom <donettom@...ux.ibm.com>
> ---
>  drivers/base/node.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..107eb508e28e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -975,5 +975,6 @@ void __init node_dev_init(void)
>  		ret = register_one_node(i);
>  		if (ret)
>  			panic("%s() failed to add node: %d\n", __func__, ret);
> +		cond_resched();

That's a horrible hack, sorry, but no, we can't sprinkle this around in
random locations, especially as this is actually fixed by using a
different scheduler model as you say.

Why not just make the code faster so as to avoid the long time this
takes?

thanks,

greg k-h