linux-kernel - Re: [BUG] mm/slub: Contention caused by kmem_cache_node->list_lock on PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aLPI77sNG6IKlZjj@harry>
Date: Sun, 31 Aug 2025 13:00:47 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Yunseong Kim <ysk@...lloc.com>
Cc: Vlastimil Babka <vbabka@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...e.de>, Thomas Gleixner <tglx@...utronix.de>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Clark Williams <clrkwllms@...nel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Christoph Lameter <cl@...two.org>,
        David Rientjes <rientjes@...gle.com>,
        Roman Gushchin <roman.gushchin@...ux.dev>, linux-mm@...ck.org,
        linux-rt-devel@...ts.linux.dev, linux-kernel@...r.kernel.org,
        Yeoreum Yun <yeoreum.yun@....com>, Byungchul Park <byungchul@...com>,
        "max.byungchul.park@...il.com" <max.byungchul.park@...il.com>,
        vvghjk1234@...il.com
Subject: Re: [BUG] mm/slub: Contention caused by kmem_cache_node->list_lock
 on PREEMPT_RT

On Sun, Aug 31, 2025 at 10:28:26AM +0900, Yunseong Kim wrote:
> I've been analyzing a system critical contention observed on a PREEMPT_RT
> enabled kernel (based on v6.17-rc3). The issue stems from internal lock
> contention within the SLUB allocator, particularly under high memory
> pressure scenarios such as massive RCU callback processing.
> 
> In PREEMPT_RT configurations, spinlock_t is implemented as a sleepable
> RT-Mutex. The kmem_cache_node->list_lock in SLUB protects the node-level
> partial slab lists and is a very hot lock, frequently accessed during the
> memory freeing path.
> 
> When multiple CPUs contend heavily for this lock, the overhead associated
> with RT-Mutexes (context switching, priority inheritance, sleeping/waking)
> causes severe stagnation in the memory subsystem's progress.
> 
> I observed a scenario where a task performing memory compaction became
> indefinitely stuck waiting for a folio lock.

Waiting for 143 seconds for a folio lock sounds pretty wild.
(Yunseong told me in an off-list conversation that he turned a couple of
debug options, though)

> It appears the owner of the folio lock was also stalled due to the contention
> on SLUB's list_lock, resulting in a system-wide contention.

Based on the stack traces you provided, I don't think we can conclude
that "the folio lock was stalled because of the contention on SLUB's
list_lock".

It's both true that 1) the task is stuck waiting for the folio lock and
2) other CPUs were processing RCU callbacks, but there is no data that
indicates how much node->list_lock contention contributed to the stall.

...and since there are only 4 CPUs in the reported system, I don't think
it's enough to cause a stall this long.

> The task is stuck in migrate_pages_batch waiting for a folio lock during 
> memory compaction:
> 
>  INFO: task hung in migrate_pages_batch
>  INFO: task syz.7.2768:30677 blocked for more than 143 seconds.
>        Not tainted 6.17.0-rc3-00269-g11e7861d680c-dirty #73
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:syz.7.2768      state:D stack:0     pid:30677 tgid:30673 ppid:20684  task_flags:0x400040 flags:0x00000011
>  Call trace:
>   __switch_to+0x2b4/0x474 arch/arm64/kernel/process.c:741 (T)
>   context_switch kernel/sched/core.c:5357 [inline]
>   __schedule+0x6c4/0xe2c kernel/sched/core.c:6961
>   __schedule_loop kernel/sched/core.c:7043 [inline]
>   schedule+0x50/0xf0 kernel/sched/core.c:7058
>   io_schedule+0x38/0xa0 kernel/sched/core.c:7903
>   folio_wait_bit_common+0x360/0x698 mm/filemap.c:1317
>   __folio_lock+0x2c/0x3c mm/filemap.c:1675
>   folio_lock include/linux/pagemap.h:1133 [inline]
>   migrate_folio_unmap mm/migrate.c:1246 [inline]
>   migrate_pages_batch+0x448/0x1a80 mm/migrate.c:1873
>   migrate_pages_sync mm/migrate.c:2023 [inline]
>   migrate_pages+0x101c/0x14f4 mm/migrate.c:2105
>   compact_zone+0x1044/0x1ae8 mm/compaction.c:2647
>   compact_node mm/compaction.c:2916 [inline]
>   compact_nodes mm/compaction.c:2938 [inline]
>   sysctl_compaction_handler+0x244/0x3f4 mm/compaction.c:2989
>   proc_sys_call_handler+0x224/0x3fc fs/proc/proc_sysctl.c:600
>   proc_sys_write+0x2c/0x3c fs/proc/proc_sysctl.c:626
>   do_iter_readv_writev+0x314/0x3e0 fs/read_write.c:-1
>   vfs_writev+0x194/0x470 fs/read_write.c:1057
>   do_pwritev fs/read_write.c:1153 [inline]
>   __do_sys_pwritev2 fs/read_write.c:1211 [inline]
>   __se_sys_pwritev2 fs/read_write.c:1202 [inline]
>   __arm64_sys_pwritev2+0xf0/0x194 fs/read_write.c:1202
>   __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>   invoke_syscall+0x64/0x168 arch/arm64/kernel/syscall.c:49
>   el0_svc_common+0xb4/0x164 arch/arm64/kernel/syscall.c:132
>   do_el0_svc+0x2c/0x3c arch/arm64/kernel/syscall.c:151
>   el0_svc+0x40/0x144 arch/arm64/kernel/entry-common.c:879
>   el0t_64_sync_handler+0x84/0x12c arch/arm64/kernel/entry-common.c:898
>   el0t_64_sync+0x1b8/0x1bc arch/arm64/kernel/entry.S:596

This is stuck while waiting for the folio to be unlocked.

>  NMI backtrace for cpu 0
>  CPU: 0 UID: 0 PID: 55 Comm: khungtaskd Kdump: loaded Not tainted 6.17.0-rc3-00269-g11e7861d680c-dirty #73 PREEMPT_{RT,(full)} 
>  Hardware name: QEMU KVM Virtual Machine, BIOS 2025.02-8ubuntu1 06/11/2025
>  Call trace:
>   show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:499 (C)
>   __dump_stack+0x30/0x40 lib/dump_stack.c:94
>   dump_stack_lvl+0x148/0x1d8 lib/dump_stack.c:120
>   dump_stack+0x1c/0x3c lib/dump_stack.c:129
>   nmi_cpu_backtrace+0x278/0x31c lib/nmi_backtrace.c:113
>   nmi_trigger_cpumask_backtrace+0x134/0x2cc lib/nmi_backtrace.c:62
>   arch_trigger_cpumask_backtrace+0x30/0x40 arch/arm64/kernel/smp.c:936
>   trigger_all_cpu_backtrace include/linux/nmi.h:160 [inline]
>   check_hung_uninterruptible_tasks kernel/hung_task.c:328 [inline]
>   watchdog+0x858/0x890 kernel/hung_task.c:491
>   kthread+0x314/0x384 kernel/kthread.c:463
>   ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:844

>  Sending NMI from CPU 0 to CPUs 1-3:
>  NMI backtrace for cpu 1
>  CPU: 1 UID: 0 PID: 28 Comm: rcuc/1 Kdump: loaded Not tainted 6.17.0-rc3-00269-g11e7861d680c-dirty #73 PREEMPT_{RT,(full)} 
>  Hardware name: QEMU KVM Virtual Machine, BIOS 2025.02-8ubuntu1 06/11/2025
>  pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
>  pc : __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:160 [inline]
>  pc : _raw_spin_unlock_irq+0x18/0x70 kernel/locking/spinlock.c:202
>  lr : __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:158 [inline]
>  lr : _raw_spin_unlock_irq+0x10/0x70 kernel/locking/spinlock.c:202
>  sp : ffff800089f0fa40
>  x29: ffff800089f0fa40 x28: ffff0000c0004640 x27: 0000000000000000
>  x26: 0000000000000000 x25: 0000000000001000 x24: ffff8000855f169c
>  x23: 0000000000000001 x22: ffff800089f0fa68 x21: ffff800089f0fb48
>  x20: ffff0000c11e6440 x19: ffff0000c0004640 x18: 000000651c596d12
>  x17: 0000000000000000 x16: 0000000000000008 x15: 0000000000000000
>  x14: 0000000000000000 x13: 0000000000000010 x12: ffff80008b0cfa68
>  x11: 0000000000000008 x10: 00000000ffffffff x9 : ffffffffffffffff
>  x8 : 0000000000000000 x7 : bbbbbbbbbbbbbbbb x6 : 392e39383520205b
>  x5 : ffff8000806849f4 x4 : 0000000000000001 x3 : 0000000000000010
>  x2 : ffff800089f0fa68 x1 : ffff0000c11e6440 x0 : ffff0000c0004640
>  Call trace:
>   __daif_local_irq_enable arch/arm64/include/asm/irqflags.h:26 [inline] (P)
>   arch_local_irq_enable arch/arm64/include/asm/irqflags.h:48 [inline] (P)
>   __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:159 [inline] (P)
>   _raw_spin_unlock_irq+0x18/0x70 kernel/locking/spinlock.c:202 (P)
>   raw_spin_unlock_irq_wake include/linux/sched/wake_q.h:82 [inline]
>   rtlock_slowlock_locked+0xcb0/0xe0c kernel/locking/rtmutex.c:1864
>   rtlock_slowlock kernel/locking/rtmutex.c:1895 [inline]

This is rtmutex slowpath.

>   rtlock_lock kernel/locking/spinlock_rt.c:43 [inline]
>   __rt_spin_lock kernel/locking/spinlock_rt.c:49 [inline]
>   rt_spin_lock+0x6c/0xe0 kernel/locking/spinlock_rt.c:57
>   spin_lock include/linux/spinlock_rt.h:44 [inline]
>   free_to_partial_list+0x74/0x5bc mm/slub.c:4427
>   __slab_free+0x208/0x254 mm/slub.c:4498
>   do_slab_free mm/slub.c:4632 [inline]
>   slab_free mm/slub.c:4681 [inline]
>   kmem_cache_free+0x320/0x5f0 mm/slub.c:4782
>   mem_pool_free mm/kmemleak.c:508 [inline]
>   free_object_rcu+0x104/0x11c mm/kmemleak.c:536
>   rcu_do_batch kernel/rcu/tree.c:2605 [inline]
>   rcu_core kernel/rcu/tree.c:2861 [inline]
>   rcu_cpu_kthread+0x404/0xcd0 kernel/rcu/tree.c:2949
>   smpboot_thread_fn+0x270/0x474 kernel/smpboot.c:160
>   kthread+0x314/0x384 kernel/kthread.c:463
>   ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:844

>  NMI backtrace for cpu 3
>  CPU: 3 UID: 0 PID: 44 Comm: rcuc/3 Kdump: loaded Not tainted 6.17.0-rc3-00269-g11e7861d680c-dirty #73 PREEMPT_{RT,(full)} 
>  Hardware name: QEMU KVM Virtual Machine, BIOS 2025.02-8ubuntu1 06/11/2025
>  pstate: 03400005 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
>  pc : finish_task_switch+0xb0/0x308 kernel/sched/core.c:5225
>  lr : raw_spin_rq_unlock kernel/sched/core.c:680 [inline]
>  lr : raw_spin_rq_unlock_irq kernel/sched/sched.h:1530 [inline]
>  lr : finish_lock_switch kernel/sched/core.c:5105 [inline]
>  lr : finish_task_switch+0xa8/0x308 kernel/sched/core.c:5223
>  sp : ffff80008b0cf940
>  x29: ffff80008b0cf950 x28: ffff0000c11fab28 x27: ffff800088169df0
>  x26: ffff0000c11fa440 x25: ffff8000855e93b4 x24: ffff800088169df0
>  x23: 0000000000001000 x22: 0000000000000000 x21: bcf48000855e8a30
>  x20: ffff0000c11fa440 x19: ffff0000c1301440 x18: 00000062aca2a5d2
>  x17: fffffffffffff63c x16: 0000000000200b20 x15: 0000000000135c81
>  x14: ffff800088169df0 x13: 000000000132d6aa x12: 00000000001f4245
>  x11: 0000000000000000 x10: 00000000ffffffff x9 : 0000000000000001
>  x8 : 0000000100000001 x7 : bbbbbbbbbbbbbbbb x6 : ffff8000a954fd48
>  x5 : 0000000000000001 x4 : 0000000000001000 x3 : ffff0000c11fa440
>  x2 : 0000000000000001 x1 : ffff80008741f318 x0 : 0000000000000001
>  Call trace:
>   __daif_local_irq_enable arch/arm64/include/asm/irqflags.h:26 [inline] (P)
>   arch_local_irq_enable arch/arm64/include/asm/irqflags.h:48 [inline] (P)
>   raw_spin_rq_unlock_irq kernel/sched/sched.h:1531 [inline] (P)
>   finish_lock_switch kernel/sched/core.c:5105 [inline] (P)
>   finish_task_switch+0xb0/0x308 kernel/sched/core.c:5223 (P)
>   context_switch kernel/sched/core.c:5360 [inline]
>   __schedule+0x6c8/0xe2c kernel/sched/core.c:6961
>   __schedule_loop kernel/sched/core.c:7043 [inline]
>   schedule_rtlock+0x24/0x44 kernel/sched/core.c:7122
>   rtlock_slowlock_locked+0xd20/0xe0c kernel/locking/rtmutex.c:1868
>   rtlock_slowlock kernel/locking/rtmutex.c:1895 [inline]
>   rtlock_lock kernel/locking/spinlock_rt.c:43 [inline]

This is also rtmutex slowpath.

>   __rt_spin_lock kernel/locking/spinlock_rt.c:49 [inline]
>   rt_spin_lock+0x6c/0xe0 kernel/locking/spinlock_rt.c:57
>   spin_lock include/linux/spinlock_rt.h:44 [inline]
>   free_to_partial_list+0x74/0x5bc mm/slub.c:4427
>   __slab_free+0x208/0x254 mm/slub.c:4498
>   do_slab_free mm/slub.c:4632 [inline]
>   slab_free mm/slub.c:4681 [inline]
>   kmem_cache_free+0x320/0x5f0 mm/slub.c:4782
>   mem_pool_free mm/kmemleak.c:508 [inline]
>   free_object_rcu+0x104/0x11c mm/kmemleak.c:536
>   rcu_do_batch kernel/rcu/tree.c:2605 [inline]
>   rcu_core kernel/rcu/tree.c:2861 [inline]
>   rcu_cpu_kthread+0x404/0xcd0 kernel/rcu/tree.c:2949
>   smpboot_thread_fn+0x270/0x474 kernel/smpboot.c:160
>   kthread+0x314/0x384 kernel/kthread.c:463
>   ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:844
>  NMI backtrace for cpu 2
>  CPU: 2 UID: 0 PID: 36 Comm: rcuc/2 Kdump: loaded Not tainted 6.17.0-rc3-00269-g11e7861d680c-dirty #73 PREEMPT_{RT,(full)} 
>  Hardware name: QEMU KVM Virtual Machine, BIOS 2025.02-8ubuntu1 06/11/2025
>  pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
>  pc : __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
>  pc : _raw_spin_unlock_irqrestore+0x2c/0x80 kernel/locking/spinlock.c:194
>  lr : __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:150 [inline]
>  lr : _raw_spin_unlock_irqrestore+0x18/0x80 kernel/locking/spinlock.c:194
>  sp : ffff80008afafa80
>  x29: ffff80008afafa80 x28: ffff0000c0004640 x27: 0000000000000000
>  x26: ffff8000806849f4 x25: ffff8002cf620000 x24: ffff0000c11f0440
>  x23: 0000000000000000 x22: 0000000000000000 x21: ffff0000c11fad30
>  x20: 0000000000000008 x19: 0000000000000000 x18: ffff80008565bf38
>  x17: 0000000000000006 x16: 0000000000000010 x15: 0000000000000000
>  x14: ffff800088169df0 x13: 0000000000000130 x12: 0000000000000110
>  x11: 0000000000000000 x10: 00000000ffffffff x9 : ffffffffffffffff
>  x8 : 00000000000000c0 x7 : bbbbbbbbbbbbbbbb x6 : 000000000000003f
>  x5 : 0000000000000001 x4 : 000000895af9d556 x3 : 0000000000000004
>  x2 : 0000000000000001 x1 : 0000000000000000 x0 : ffff0000c11fad30
>  Call trace:
>   __daif_local_irq_restore arch/arm64/include/asm/irqflags.h:175 [inline] (P)
>   arch_local_irq_restore arch/arm64/include/asm/irqflags.h:195 [inline] (P)
>   __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:151 [inline] (P)
>   _raw_spin_unlock_irqrestore+0x2c/0x80 kernel/locking/spinlock.c:194 (P)
>   class_raw_spinlock_irqsave_destructor include/linux/spinlock.h:557 [inline]
>   try_to_wake_up+0x3b0/0x7e0 kernel/sched/core.c:4216
>   wake_up_state+0x14/0x20 kernel/sched/core.c:4465
>   rt_mutex_wake_up_q kernel/locking/rtmutex.c:566 [inline]
>   rt_mutex_slowunlock+0x16c/0x2ac kernel/locking/rtmutex.c:1469
>   rt_spin_unlock+0x24/0x34 kernel/locking/spinlock_rt.c:85
>   spin_unlock_irqrestore include/linux/spinlock_rt.h:122 [inline]
>   free_to_partial_list+0x2b8/0x5bc mm/slub.c:4466
>   __slab_free+0x208/0x254 mm/slub.c:4498
>   do_slab_free mm/slub.c:4632 [inline]
>   slab_free mm/slub.c:4681 [inline]
>   kmem_cache_free+0x320/0x5f0 mm/slub.c:4782
>   mem_pool_free mm/kmemleak.c:508 [inline]
>   free_object_rcu+0x104/0x11c mm/kmemleak.c:536
>   rcu_do_batch kernel/rcu/tree.c:2605 [inline]
>   rcu_core kernel/rcu/tree.c:2861 [inline]
>   rcu_cpu_kthread+0x404/0xcd0 kernel/rcu/tree.c:2949
>   smpboot_thread_fn+0x270/0x474 kernel/smpboot.c:160
>   kthread+0x314/0x384 kernel/kthread.c:463
>   ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:844

This task is holding the rtmutex, and it's just about to release the lock.

So.. I think the right questions to ask for further investigation is:

1. Which task locked the folio and why is it holding the lock for so long?
2. How severe is the memory pressure?
3. How much does debug options contribute to the stall?

> NMI Backtrace shows RCU threads (rcuc/X) on multiple CPUs are experiencing
> severe contention while trying to acquire the RT-Mutex in
> free_to_partial_list(). Similar contention traces were observed on other
> CPUs.

I wouldn't say it's a severe lock contention. One task is holding the lock,
and two tasks are waiting for the lock. We don't know how much these
tasks are waiting for the lock, but presumably not for too long as only
three tasks are using the lock.

Cheers,
Harry / Hyeonggon

> The core issue is that kmem_cache_node->list_lock is too hot to operate as
> a sleeping lock (RT-Mutex) under high contention scenarios. I've seeking
> community feedback on the best approach to ensure system stability while
> maintaining RT guarantees. Here’s my thought on a possible direction:
> 
> 1. Convert list_lock to raw_spinlock_t (Immediate Fix)
> 
>  The most straightforward solution is to change kmem_cache_node->list_lock
>  from spinlock_t to raw_spinlock_t. This ensures the lock remains a
>  non-sleeping spinlock even on PREEMPT_RT.
> 
>    - Pros:
>      Reliably resolves the deadlock by eliminating the RT-Mutex overhead.
>      Minimal code changes required.
> 
>    - Cons:
>      Reintroduces a traditional spinlock, which could theoretically
>      slightly increase latency for high-priority RT tasks. However, given
>      the very short critical section protected by this lock, this may be a
>      reasonable trade-off for stability.
> 
> 2. Reducing Lock Contention
> 
>  Instead of changing the lock type, we could reduce the frequency of
>  acquiring the node-level list_lock. Contention primarily occurs when
>  per-CPU partial lists (kmem_cache_cpu->partial) are flushed to the node
>  list (kmem_cache_node->partial).
>  Tuning and Batching: We could adjust the thresholds in flush_cpu_slab()
>  or enhance batch processing when moving slabs to reduce the number of
>  lock acquisitions.
> 
>    - Pros:
>      Improves overall SLUB scalability while maintaining PREEMPT_RT locking
>      semantics (using RT-Mutexes).
> 
>    - Cons:
>      Implementation and tuning are complex. It may increase per-CPU memory
>      usage and might not entirely resolve contention under extreme loads.
> 
> 3. Deferring Node List Updates
> 
>  A structural change to avoid acquiring the list_lock (RT-Mutex) in the
>  memory freeing fast path. Instead of moving slabs to the node list
>  immediately by the task freeing the memory (especially in RCU callback
>  context), this work could be deferred to a dedicated workqueue or kthread.
> 
>    - Pros:
>      Removes heavy RT-Mutex acquisition from the fast path, potentially
>      improving response times.
> 
>    - Cons:
>      Adds significant complexity to the SLUB architecture. It might impact
>      performance in non-RT environments or delay memory reclamation.
> 
> Given the severity of the observed deadlock, Option 1 using raw_spinlock_t
> appears to be the most pragmatic and immediate solution to guarantee system
> stability. However, I would like to hear the community's opinion on whether
> this is the best approach aligned with PREEMPT_RT goals, or if the MM/RT
> community prefers a longer-term structural improvement like Option 2 or 3.
> 
> Please let me know if further adjustments, testing, or reproduction are
> needed. Welcome any feedback or suggestions regarding this issue.
> 
> Best regards.
> Yunseong Kim