[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0f9a6f66-7cbc-4c0d-b12e-9eaacdf1bda8@amd.com>
Date: Thu, 13 Feb 2025 11:09:37 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: linux-mm@...ck.org, linux-kernel@...r.kernel.org, gourry@...rry.net,
nehagholkar@...a.com, abhishekd@...a.com, david@...hat.com,
ying.huang@...el.com, nphamcs@...il.com, akpm@...ux-foundation.org,
hannes@...xchg.org, feng.tang@...el.com, kbusch@...a.com, bharata@....com,
Hasan.Maruf@....com, sj@...nel.org, willy@...radead.org,
kirill.shutemov@...ux.intel.com, mgorman@...hsingularity.net,
vbabka@...e.cz, hughd@...gle.com, rientjes@...gle.com, shy828301@...il.com,
Liam.Howlett@...cle.com, peterz@...radead.org, mingo@...hat.com
Subject: Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A
bit
On 2/12/2025 10:32 PM, Davidlohr Bueso wrote:
> On Sun, 01 Dec 2024, Raghavendra K T wrote:
>
>> 6. Holding PTE lock before migration.
>
> fyi I tried testing this series with 'perf-bench numa mem' and got a
> soft lockup,
> unable to take the PTL (and lost the machine to debug further atm), ie:
>
> [ 3852.217675] CPU: 127 UID: 0 PID: 12537 Comm: watch-numa-sche Tainted:
> G D L 6.14.0-rc2-kmmscand-v1+ #3
> [ 3852.217677] Tainted: [D]=DIE, [L]=SOFTLOCKUP
> [ 3852.217678] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x290
> [ 3852.217683] Code: 77 7b f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2
> 08 30 e4 09 d0 3d ff 00 00 00 77 57 85 c0 74 10 0f b6 03 84 c0 74 09 f3
> 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 5d 41 5c 41 5d c3
> [ 3852.217684] RSP: 0018:ff274259b3c9f988 EFLAGS: 00000202
> [ 3852.217685] RAX: 0000000000000001 RBX: ffbd2efd8c08c9a8 RCX:
> 000ffffffffff000
> [ 3852.217686] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
> ffbd2efd8c08c9a8
> [ 3852.217687] RBP: ff161328422c1328 R08: ff274259b3c9fb90 R09:
> ff161328422c1000
> [ 3852.217688] R10: 00000000ffffffff R11: 0000000000000004 R12:
> 00007f52cca00000
> [ 3852.217688] R13: ff274259b3c9fa00 R14: ff16132842326000 R15:
> ff161328422c1328
> [ 3852.217689] FS: 00007f32b6f92b80(0000) GS:ff161423bfd80000(0000)
> knlGS:0000000000000000
> [ 3852.217691] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3852.217692] CR2: 0000564ddbf68008 CR3: 00000080a81cc005 CR4:
> 0000000000773ef0
> [ 3852.217693] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 3852.217694] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> 0000000000000400
> [ 3852.217694] PKRU: 55555554
> [ 3852.217695] Call Trace:
> [ 3852.217696] <IRQ>
> [ 3852.217697] ? watchdog_timer_fn+0x21b/0x2a0
> [ 3852.217699] ? __pfx_watchdog_timer_fn+0x10/0x10
> [ 3852.217702] ? __hrtimer_run_queues+0x10f/0x2a0
> [ 3852.217704] ? hrtimer_interrupt+0xfb/0x240
> [ 3852.217706] ? __sysvec_apic_timer_interrupt+0x4e/0x110
> [ 3852.217709] ? sysvec_apic_timer_interrupt+0x68/0x90
> [ 3852.217712] </IRQ>
> [ 3852.217712] <TASK>
> [ 3852.217713] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 3852.217717] ? native_queued_spin_lock_slowpath+0x64/0x290
> [ 3852.217720] _raw_spin_lock+0x25/0x30
> [ 3852.217723] __pte_offset_map_lock+0x9a/0x110
> [ 3852.217726] gather_pte_stats+0x1e3/0x2c0
> [ 3852.217730] walk_pgd_range+0x528/0xbb0
> [ 3852.217733] __walk_page_range+0x71/0x1d0
> [ 3852.217736] walk_page_vma+0x98/0xf0
> [ 3852.217738] show_numa_map+0x11a/0x3a0
> [ 3852.217741] seq_read_iter+0x2a6/0x470
> [ 3852.217745] seq_read+0x12b/0x170
> [ 3852.217748] vfs_read+0xe0/0x370
> [ 3852.217751] ? syscall_exit_to_user_mode+0x49/0x210
> [ 3852.217755] ? do_syscall_64+0x8a/0x190
> [ 3852.217758] ksys_read+0x6a/0xe0
> [ 3852.217762] do_syscall_64+0x7e/0x190
> [ 3852.217765] ? __memcg_slab_free_hook+0xd4/0x120
> [ 3852.217768] ? __x64_sys_close+0x38/0x80
> [ 3852.217771] ? kmem_cache_free+0x3bf/0x3e0
> [ 3852.217774] ? syscall_exit_to_user_mode+0x49/0x210
> [ 3852.217777] ? do_syscall_64+0x8a/0x190
> [ 3852.217780] ? do_syscall_64+0x8a/0x190
> [ 3852.217783] ? __irq_exit_rcu+0x3e/0xe0
> [ 3852.217785] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Hello David,
Thanks for reporting, details. Reproducer information helps me
to stabilize the code quickly. Micro-benchmark I used did not show any
issues. I will add PTL lock and also check the issue from my side..
(with multiple scanning threads, it could cause even more issues because
of more migration pressure, wondering if I should go ahead with more
stabilized single thread scanning version in the coming post)
Thanks and Regards
- Raghu
Powered by blists - more mailing lists