lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f89dd7e-937b-4575-855b-561ff6e932e5@redhat.com>
Date: Thu, 10 Apr 2025 11:35:30 +1000
From: Gavin Shan <gshan@...hat.com>
To: Aditya Gupta <adityag@...ux.ibm.com>, linux-mm@...ck.org
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Danilo Krummrich <dakr@...nel.org>, David Hildenbrand <david@...hat.com>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
 Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
 Oscar Salvador <osalvador@...e.de>, "Rafael J. Wysocki" <rafael@...nel.org>,
 Sourabh Jain <sourabhjain@...ux.ibm.com>, linux-kernel@...r.kernel.org,
 Gavin Shan <gshan@...hat.com>, Gavin Shan <shan.gavin@...il.com>
Subject: Re: [REPORT] Softlockups on PowerNV with upstream

Hi Aditya,

On 4/10/25 4:03 AM, Aditya Gupta wrote:
> 
> While booting current upstream kernel, I consistently get "softlockups", on IBM PowerNV system.
> 
> I have tested it only on PowerNV systems. But some architectures/platforms also
> might have it. PSeries systems don't have this issue though.
> 
> Bisect points to the following commit:
> 
>      commit 61659efdb35ce6c6ac7639342098f3c4548b794b
>      Author: Gavin Shan <gshan@...hat.com>
>      Date:   Wed Mar 12 09:30:43 2025 +1000
> 
>          drivers/base/memory: improve add_boot_memory_block()
> 
>          Patch series "drivers/base/memory: Two cleanups", v3.
> 
>          Two cleanups to drivers/base/memory.
> 
> 
>          This patch (of 2)L
> 
>          It's unnecessary to count the present sections for the specified block
>          since the block will be added if any section in the block is present.
>          Besides, for_each_present_section_nr() can be reused as Andrew Morton
>          suggested.
> 
>          Improve by using for_each_present_section_nr() and dropping the
>          unnecessary @section_count.
> 
>          No functional changes intended.
> 
>          ...
> 
> Pasted the console log, bisect log, and the kernel config, below.
> 

I don't see how 61659efdb35ce ("drivers/base/memory: improve add_boot_memory_block()")
causes any logical changes. Could you help to revert it on top of v6.15.rc1 to confirm
the RCU stall and softlockup issue is still existing?

At present, I don't have access to a Power10 machine, but I will check around.

> Thanks,
> - Aditya G
> 
> Console log
> -----------
> 
>      [    2.783371] smp: Brought up 4 nodes, 256 CPUs
>      [    2.783475] numa: Node 0 CPUs: 0-63
>      [    2.783537] numa: Node 2 CPUs: 64-127
>      [    2.783591] numa: Node 4 CPUs: 128-191
>      [    2.783653] numa: Node 6 CPUs: 192-255
>      [    2.804945] Memory: 735777792K/738197504K available (17536K kernel code, 5760K rwdata, 15232K rodata, 6528K init, 2517K bss, 1369664K reserved, 0K cma-reserved)

The NUMA node number leaps by one. It seems the machine has 800GB memory if I'm correct.

>      [    2.892969] devtmpfs: initialized
>      [   24.057853] watchdog: BUG: soft lockup - CPU#248 stuck for 22s! [swapper/248:1]
>      [   24.057861] Modules linked in:
>      [   24.057872] CPU: 248 UID: 0 PID: 1 Comm: swapper/248 Not tainted 6.15.0-rc1-next-20250408 #1 VOLUNTARY
>      [   24.057879] Hardware name: 9105-22A POWER10 (raw) 0x800200 opal:v7.1-107-gfda75d121942 PowerNV
>      [   24.057883] NIP:  c00000000209218c LR: c000000002092204 CTR: 0000000000000000
>      [   24.057886] REGS: c00040000418fa30 TRAP: 0900   Not tainted  (6.15.0-rc1-next-20250408)
>      [   24.057891] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 28000428  XER: 00000000
>      [   24.057904] CFAR: 0000000000000000 IRQMASK: 0
>      [   24.057904] GPR00: c000000002092204 c00040000418fcd0 c000000001b08100 0000000000000040
>      [   24.057904] GPR04: 0000000000013e00 c000c03ffebabb00 0000000000c03fff c000400fff587f80
>      [   24.057904] GPR08: 0000000000000000 00000000001196f7 0000000000000000 0000000028000428
>      [   24.057904] GPR12: 0000000000000000 c000000002e80000 c00000000001007c 0000000000000000
>      [   24.057904] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>      [   24.057904] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>      [   24.057904] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>      [   24.057904] GPR28: c000000002df7f70 0000000000013dc0 c0000000011dd898 0000000008000000
>      [   24.057948] NIP [c00000000209218c] memory_dev_init+0x114/0x1e0
>      [   24.057963] LR [c000000002092204] memory_dev_init+0x18c/0x1e0
>      [   24.057968] Call Trace:
>      [   24.057970] [c00040000418fcd0] [c000000002092204] memory_dev_init+0x18c/0x1e0 (unreliable)
>      [   24.057976] [c00040000418fd50] [c000000002091348] driver_init+0x78/0xa4
>      [   24.057981] [c00040000418fd70] [c0000000020063ac] kernel_init_freeable+0x22c/0x370
>      [   24.057989] [c00040000418fde0] [c0000000000100a8] kernel_init+0x34/0x25c
>      [   24.057996] [c00040000418fe50] [c00000000000cd94] ret_from_kernel_user_thread+0x14/0x1c
>      [   24.058004] --- interrupt: 0 at 0x0
>      [   24.058010] Code: 7fa9eb78 e8aa0000 2fa50000 60000000 60420000 7c29f840 792aaac2 40800034 419e0030 794a1f24 7d45502a 2c2a0000 <41820020> 79282c28 7cea4214 2c270000
>      ...
>      [   62.952729] rcu: INFO: rcu_sched self-detected stall on CPU
>      [   62.952782] rcu:     248-....: (5999 ticks this GP) idle=5884/1/0x4000000000000002 softirq=81/81 fqs=1997
>      [   62.952965] rcu:     (t=6000 jiffies g=-1015 q=1 ncpus=256)
>      [   62.953050] CPU: 248 UID: 0 PID: 1 Comm: swapper/248 Tainted: G             L      6.15.0-rc1-next-20250408 #1 VOLUNTARY
>      [   62.953055] Tainted: [L]=SOFTLOCKUP
>      [   62.953057] Hardware name: 9105-22A POWER10 (raw) 0x800200 opal:v7.1-107-gfda75d121942 PowerNV
>      [   62.953059] NIP:  c000000002092180 LR: c000000002092204 CTR: 0000000000000000
>      [   62.953062] REGS: c00040000418fa30 TRAP: 0900   Tainted: G             L       (6.15.0-rc1-next-20250408)
>      [   62.953065] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 88000428  XER: 00000000
>      [   62.953076] CFAR: 0000000000000000 IRQMASK: 0
>      [   62.953076] GPR00: c000000002092204 c00040000418fcd0 c000000001b08100 0000000000000040
>      [   62.953076] GPR04: 0000000000035940 c000c03ffebabb00 0000000000c03fff c000400fff587f80
>      [   62.953076] GPR08: 0000000000000000 00000000002c390b 0000000000000587 0000000028000428
>      [   62.953076] GPR12: 0000000000000000 c000000002e80000 c00000000001007c 0000000000000000
>      [   62.953076] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>      [   62.953076] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>      [   62.953076] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>      [   62.953076] GPR28: c000000002df7f70 0000000000035900 c0000000011dd898 0000000008000000
>      [   62.953117] NIP [c000000002092180] memory_dev_init+0x108/0x1e0
>      [   62.953121] LR [c000000002092204] memory_dev_init+0x18c/0x1e0
>      [   62.953125] Call Trace:
>      [   62.953126] [c00040000418fcd0] [c000000002092204] memory_dev_init+0x18c/0x1e0 (unreliable)
>      [   62.953131] [c00040000418fd50] [c000000002091348] driver_init+0x78/0xa4
>      [   62.953135] [c00040000418fd70] [c0000000020063ac] kernel_init_freeable+0x22c/0x370
>      [   62.953141] [c00040000418fde0] [c0000000000100a8] kernel_init+0x34/0x25c
>      [   62.953146] [c00040000418fe50] [c00000000000cd94] ret_from_kernel_user_thread+0x14/0x1c
>      [   62.953152] --- interrupt: 0 at 0x0
>      [   62.953155] Code: 4181ffe8 3d22012f 3949fe68 7fa9eb78 e8aa0000 2fa50000 60000000 60420000 7c29f840 792aaac2 40800034 419e0030 <794a1f24> 7d45502a 2c2a0000 41820020
> 
> Bisect Log
> ----------
> 
>      git bisect start
>      # status: waiting for both good and bad commits
>      # good: [38fec10eb60d687e30c8c6b5420d86e8149f7557] Linux 6.14
>      git bisect good 38fec10eb60d687e30c8c6b5420d86e8149f7557
>      # status: waiting for bad commit, 1 good commit known
>      # bad: [7702d0130dc002bab2c3571ddb6ff68f82d99aea] Add linux-next specific files for 20250408
>      git bisect bad 7702d0130dc002bab2c3571ddb6ff68f82d99aea
>      # good: [390513642ee6763c7ada07f0a1470474986e6c1c] io_uring: always do atomic put from iowq
>      git bisect good 390513642ee6763c7ada07f0a1470474986e6c1c
>      # bad: [eb0ece16027f8223d5dc9aaf90124f70577bd22a] Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
>      git bisect bad eb0ece16027f8223d5dc9aaf90124f70577bd22a
>      # good: [7d06015d936c861160803e020f68f413b5c3cd9d] Merge tag 'pci-v6.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
>      git bisect good 7d06015d936c861160803e020f68f413b5c3cd9d
>      # good: [fa593d0f969dcfa41d390822fdf1a0ab48cd882c] Merge tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
>      git bisect good fa593d0f969dcfa41d390822fdf1a0ab48cd882c
>      # good: [f64a72bc767f6e9ddb18fdacaeb99708c4810ada] Merge tag 'v6.15rc-part1-ksmbd-server-fixes' of git://git.samba.org/ksmbd
>      git bisect good f64a72bc767f6e9ddb18fdacaeb99708c4810ada
>      # good: [a14efee04796dd3f614eaf5348ca1ac099c21349] mm/page_alloc: clarify should_claim_block() commentary
>      git bisect good a14efee04796dd3f614eaf5348ca1ac099c21349
>      # good: [f0e11a997ab438ce91a7dc9a6dd64c0c4a6af112] mm/vmalloc: refactor __vmalloc_node_range_noprof()
>      git bisect good f0e11a997ab438ce91a7dc9a6dd64c0c4a6af112
>      # bad: [735b3f7e773bd09d459537562754debd1f8e816b] selftests/mm: uffd-unit-tests support for hugepages > 2M
>      git bisect bad 735b3f7e773bd09d459537562754debd1f8e816b
>      # bad: [d2734f044f84833b2c9ec1b71b542d299d35202b] mm: memory-failure: enhance comments for return value of memory_failure()
>      git bisect bad d2734f044f84833b2c9ec1b71b542d299d35202b
>      # bad: [61659efdb35ce6c6ac7639342098f3c4548b794b] drivers/base/memory: improve add_boot_memory_block()
>      git bisect bad 61659efdb35ce6c6ac7639342098f3c4548b794b
>      # good: [58729c04cf1092b87aeef0bf0998c9e2e4771133] mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
>      git bisect good 58729c04cf1092b87aeef0bf0998c9e2e4771133
>      # good: [80a5c494c89f73907ed659a9233a70253774cdae] selftests/mm: add tests for folio_split(), buddy allocator like split
>      git bisect good 80a5c494c89f73907ed659a9233a70253774cdae
>      # good: [d53c78fffe7ad364397c693522ceb4d152c2aacd] mm/shmem: use xas_try_split() in shmem_split_large_entry()
>      git bisect good d53c78fffe7ad364397c693522ceb4d152c2aacd
>      # good: [c637c61c9ed0203d9a1f2ba21fb7a49ddca3ef8f] mm/damon/sysfs-schemes: avoid Wformat-security warning on damon_sysfs_access_pattern_add_range_dir()
>      git bisect good c637c61c9ed0203d9a1f2ba21fb7a49ddca3ef8f
>      # first bad commit: [61659efdb35ce6c6ac7639342098f3c4548b794b] drivers/base/memory: improve add_boot_memory_block()
> 
> To Reproduce the issue
> ----------------------
> 
> Build the upstream kernel and boot on a PowerNV Power10 hardware
> 
> Kernel config
> -------------
> 
> This should occur with any default configs you may have, or can use the following:
> 
> https://gist.github.com/adi-g15-ibm/6eb03cea2c6202e5eb017abd3819a491
> 
> CC list
> -------
> 
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Danilo Krummrich <dakr@...nel.org>
> Cc: David Hildenbrand <david@...hat.com>
> Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
> Cc: Mahesh J Salgaonkar <mahesh@...ux.ibm.com>
> Cc: Oscar Salvador <osalvador@...e.de>
> Cc: "Rafael J. Wysocki" <rafael@...nel.org>
> Cc: Sourabh Jain <sourabhjain@...ux.ibm.com>
> Cc: linux-kernel@...r.kernel.org
> To: linux-mm@...ck.org
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ