[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dc4c0d4e-a9a5-4fa5-b39d-4248fba26043@redhat.com>
Date: Thu, 10 Apr 2025 15:35:19 +1000
From: Gavin Shan <gshan@...hat.com>
To: Oscar Salvador <osalvador@...e.de>, Aditya Gupta <adityag@...ux.ibm.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Danilo Krummrich <dakr@...nel.org>, David Hildenbrand <david@...hat.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
"Rafael J. Wysocki" <rafael@...nel.org>,
Sourabh Jain <sourabhjain@...ux.ibm.com>, linux-kernel@...r.kernel.org
Subject: Re: [REPORT] Softlockups on PowerNV with upstream
On 4/10/25 3:25 PM, Oscar Salvador wrote:
> On Wed, Apr 09, 2025 at 11:33:44PM +0530, Aditya Gupta wrote:
>> Hi,
>>
>> While booting current upstream kernel, I consistently get "softlockups", on IBM PowerNV system.
>>
>> I have tested it only on PowerNV systems. But some architectures/platforms also
>> might have it. PSeries systems don't have this issue though.
>>
>> Bisect points to the following commit:
>>
>> commit 61659efdb35ce6c6ac7639342098f3c4548b794b
>> Author: Gavin Shan <gshan@...hat.com>
>> Date: Wed Mar 12 09:30:43 2025 +1000
>>
>> drivers/base/memory: improve add_boot_memory_block()
>>
> ...
>> Console log
>> -----------
>>
>> [ 2.783371] smp: Brought up 4 nodes, 256 CPUs
>> [ 2.783475] numa: Node 0 CPUs: 0-63
>> [ 2.783537] numa: Node 2 CPUs: 64-127
>> [ 2.783591] numa: Node 4 CPUs: 128-191
>> [ 2.783653] numa: Node 6 CPUs: 192-255
>> [ 2.804945] Memory: 735777792K/738197504K available (17536K kernel code, 5760K rwdata, 15232K rodata, 6528K init, 2517K bss, 1369664K reserved, 0K cma-reserved)
>
> If I am not mistaken this is ~700GB, and PowerNV uses 16MB as section size,
> and sections_per_block == 1 (I think).
>
> The code before the mentioned commit, was something like:
>
> for (nr = base_section_nr; nr < base_section_nr + sections_per_block; nr++)
> if (present_section_nr(nr))
> section_count++;
>
> if (section_count == 0)
> return 0;
> return add_memory_block()
>
> So, in case of PowerNV , we will just check one section at a time and
> either return or call add_memory_block depending whether it is present.
>
> Now, with the current code that is something different.
> We now have
>
> memory_dev_init:
> for(nr = 0, nr <= __highest_present_section_nr; nr += 1)
> ret = add_boot_memory_block
>
> add_boot_memory_block:
> for_each_present_section_nr(base_section_nr, nr) {
> if (nr >= (base_section_nr + sections_per_block))
> break;
>
> return add_memory_block();
> }
> return 0;
>
> The thing is that next_present_section_nr() (which is called in
> for_each_present_section_nr()) will loop until we find a present
> section.
> And then we will check whether the found section is beyond
> base_section_nr + sections_per_block (where sections_per_block = 1).
> If so, we skip add_memory_block.
>
> Now, I think that the issue comes from for_each_present_section_nr
> having to loop a lot until we find a present section.
> And then the loop in memory_dev_init increments only by 1, which means
> that the next iteration we might have to loop a lot again to find the
> another present section. And so on and so forth.
>
> Maybe we can fix this by making memory_dev_init() remember in which
> section add_boot_memory_block returns.
> Something like the following (only compile-tested)
>
Thanks, Oscar. You're correct that the overhead is introduced by for_each_present_section_nr().
I already had the fix, working on IBM's Power9 machine, where the issue can be
reproduced. Please see the attached patch.
I'm having most tests on ARM64 machine for the fix.
Thanks,
Gavin
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 8f3a41d9bfaa..d97635cbfd1d 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -816,18 +816,25 @@ static int add_memory_block(unsigned long block_id, unsigned long state,
> return 0;
> }
>
> -static int __init add_boot_memory_block(unsigned long base_section_nr)
> +static int __init add_boot_memory_block(unsigned long *base_section_nr)
> {
> + int ret;
> unsigned long nr;
>
> - for_each_present_section_nr(base_section_nr, nr) {
> - if (nr >= (base_section_nr + sections_per_block))
> + for_each_present_section_nr(*base_section_nr, nr) {
> + if (nr >= (*base_section_nr + sections_per_block))
> break;
>
> - return add_memory_block(memory_block_id(base_section_nr),
> - MEM_ONLINE, NULL, NULL);
> + ret = add_memory_block(memory_block_id(*base_section_nr),
> + MEM_ONLINE, NULL, NULL);
> + *base_section = nr;
> + return ret;
> }
>
> + if (nr == -1)
> + *base_section = __highest_present_section_nr + 1;
> + else
> + *base_section = nr;
> return 0;
> }
>
> @@ -973,9 +980,9 @@ void __init memory_dev_init(void)
> * Create entries for memory sections that were found
> * during boot and have been initialized
> */
> - for (nr = 0; nr <= __highest_present_section_nr;
> - nr += sections_per_block) {
> - ret = add_boot_memory_block(nr);
> + nr = first_present_section_nr();
> + for (; nr <= __highest_present_section_nr; nr += sections_per_block) {
> + ret = add_boot_memory_block(&nr);
> if (ret)
> panic("%s() failed to add memory block: %d\n", __func__,
> ret);
>
>
> @Aditya: can you please give it a try?
>
>
>
View attachment "0001-drivers-base-memory-Avoid-overhead-for_each_present_.patch" of type "text/x-patch" (4736 bytes)
Powered by blists - more mailing lists