lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <23a82166-820b-4baa-90ee-2d6d1de4f4d3@efficios.com>
Date: Thu, 10 Jul 2025 10:18:07 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Gabriele Monaco <gmonaco@...hat.com>,
 kernel test robot <oliver.sang@...el.com>
Cc: oe-lkp@...ts.linux.dev, lkp@...el.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, aubrey.li@...ux.intel.com,
 yu.c.chen@...el.com, Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, Ingo Molnar <mingo@...hat.org>
Subject: Re: [PATCH v14 2/3] sched: Move task_mm_cid_work to mm timer

On 2025-07-10 09:40, Gabriele Monaco wrote:
> 
> 
> On Thu, 2025-07-10 at 09:23 -0400, Mathieu Desnoyers wrote:
>> On 2025-07-10 00:56, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed "WARNING:inconsistent_lock_state" on:
>>>
>>> commit: d06e66c6025e44136e6715d24c23fb821a415577 ("[PATCH v14 2/3]
>>> sched: Move task_mm_cid_work to mm timer")
>>> url:
>>> https://github.com/intel-lab-lkp/linux/commits/Gabriele-Monaco/sched-Add-prev_sum_exec_runtime-support-for-RT-DL-and-SCX-classes/20250707-224959
>>> patch link:
>>> https://lore.kernel.org/all/20250707144824.117014-3-gmonaco@redhat.com/
>>> patch subject: [PATCH v14 2/3] sched: Move task_mm_cid_work to mm
>>> timer
>>>
>>> in testcase: boot
>>>
>>> config: x86_64-randconfig-003-20250708
>>> compiler: gcc-11
>>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp
>>> 2 -m 16G
>>>
>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>>
>>>
>>> +-------------------------------------------------+------------+---
>>> ---------+
>>>>                                                  | 50c1dc07ee |
>>>> d06e66c602 |
>>> +-------------------------------------------------+------------+---
>>> ---------+
>>>> WARNING:inconsistent_lock_state                 | 0          |
>>>> 12         |
>>>> inconsistent{SOFTIRQ-ON-W}->{IN-SOFTIRQ-W}usage | 0          |
>>>> 12         |
>>> +-------------------------------------------------+------------+---
>>> ---------+
>>>
>>
>> I suspect the issue comes from calling mmdrop(mm) from timer context
>> in a scenario
>> where the mm_count can drop to 0.
>>
>> This causes calls to pgd_free() and such to take the pgd_lock in
>> softirq
>> context, when in other cases it's taken with softirqs enabled.
>>
>> See "mmdrop_sched()" for RT. I think we need something similar for
>> the
>> non-RT case, e.g. a:
>>
>> static inline void __mmdrop_delayed(struct rcu_head *rhp)
>> {
>>           struct mm_struct *mm = container_of(rhp, struct mm_struct,
>> delayed_drop);
>>
>>           __mmdrop(mm);
>> }
>>
>> static inline void mmdrop_timer(struct mm_struct *mm)
>> {
>>           /* Provides a full memory barrier. See mmdrop() */
>>           if (atomic_dec_and_test(&mm->mm_count))
>>                   call_rcu(&mm->delayed_drop, __mmdrop_delayed);
>> }
>>
>> Thoughts ?
>>
> 
> Thanks for the suggestion.
> 
> I noticed the problem is in the mmdrop over there, but I'm seeing this
> is getting unnecessarily complicated.
> I'm not sure it's worth going down this path, also considering pushing
> the timer wheel like this might end up in unintended effects like it
> happened with the workqueue.
> 
> I am going to try the alternative approach of running the scan in
> batches [1] still using a task_work but triggering it from
> __rseq_handle_notify_resume like here.
> If that works in the original usecase, I guess it's better to keep it
> that way.
> 
> What do you think?

Yes, I think the batching approach makes sense considering the overhead
of worker threads when used periodically at 100ms intervals, the
complexity that arises from doing mmdrop() from timer context, and also
the fact that doing task_mm_cid_scan (iteration on all possible cpus)
from timer context may introduce latency on configurations that
implement timers with softirqs.

It will delay how much time it takes for cid compaction to react to
threads exiting though (wrt selftests/rseq: Add test for mm_cid
compaction). We will probably want to update this test to take into
account that the time it takes for compaction to complete depends on
the number of possible cpus.

Thanks,

Mathieu

> 
> Thanks,
> Gabriele
> 
> [1] -
> https://lore.kernel.org/lkml/20250217112317.258716-1-gmonaco@redhat.com
> 
>> Thanks,
>>
>> Mathieu
>>
>>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a
>>> new version of
>>> the same patch/commit), kindly add following tags
>>>> Reported-by: kernel test robot <oliver.sang@...el.com>
>>>> Closes:
>>>> https://lore.kernel.org/oe-lkp/202507100606.90787fe6-lkp@intel.com
>>>
>>>
>>> [   26.556715][    C0] WARNING: inconsistent lock state
>>> [   26.557127][    C0] 6.16.0-rc5-00002-gd06e66c6025e #1 Tainted:
>>> G                T
>>> [   26.557730][    C0] --------------------------------
>>> [   26.558133][    C0] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-
>>> W} usage.
>>> [   26.558662][    C0] stdbuf/386 [HC0[0]:SC1[1]:HE1:SE0] takes:
>>> [ 26.559118][ C0] ffffffff870d4438 (pgd_lock){+.?.}-{3:3}, at:
>>> pgd_free (arch/x86/mm/pgtable.c:67 arch/x86/mm/pgtable.c:98
>>> arch/x86/mm/pgtable.c:379)
>>> [   26.559786][    C0] {SOFTIRQ-ON-W} state was registered at:
>>> [ 26.560232][ C0] mark_usage (kernel/locking/lockdep.c:4669)
>>> [ 26.560561][ C0] __lock_acquire (kernel/locking/lockdep.c:5194)
>>> [ 26.560929][ C0] lock_acquire (kernel/locking/lockdep.c:473
>>> kernel/locking/lockdep.c:5873)
>>> [ 26.561267][ C0] _raw_spin_lock
>>> (include/linux/spinlock_api_smp.h:134
>>> kernel/locking/spinlock.c:154)
>>> [ 26.561617][ C0] pgd_alloc (arch/x86/mm/pgtable.c:86
>>> arch/x86/mm/pgtable.c:353)
>>> [ 26.561950][ C0] mm_init+0x64f/0xbfb
>>> [ 26.562342][ C0] mm_alloc (kernel/fork.c:1109)
>>> [ 26.562655][ C0] dma_resv_lockdep (drivers/dma-buf/dma-resv.c:784)
>>> [ 26.563020][ C0] do_one_initcall (init/main.c:1274)
>>> [ 26.563389][ C0] do_initcalls (init/main.c:1335 init/main.c:1352)
>>> [ 26.563744][ C0] kernel_init_freeable (init/main.c:1588)
>>> [ 26.564144][ C0] kernel_init (init/main.c:1476)
>>> [ 26.564402][ C0] ret_from_fork (arch/x86/kernel/process.c:154)
>>> [ 26.564633][ C0] ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
>>> [   26.564871][    C0] irq event stamp: 4774
>>> [ 26.565070][ C0] hardirqs last enabled at (4774):
>>> _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42
>>> arch/x86/include/asm/irqflags.h:119
>>> include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202)
>>> [ 26.565526][ C0] hardirqs last disabled at (4773):
>>> _raw_spin_lock_irq (arch/x86/include/asm/preempt.h:80
>>> include/linux/spinlock_api_smp.h:118 kernel/locking/spinlock.c:170)
>>> [ 26.565971][ C0] softirqs last enabled at (4256): local_bh_enable
>>> (include/linux/bottom_half.h:33)
>>> [ 26.566408][ C0] softirqs last disabled at (4771): __do_softirq
>>> (kernel/softirq.c:614)
>>> [   26.566823][    C0]
>>> [   26.566823][    C0] other info that might help us debug this:
>>> [   26.567198][    C0]  Possible unsafe locking scenario:
>>> [   26.567198][    C0]
>>> [   26.567548][    C0]        CPU0
>>> [   26.567709][    C0]        ----
>>> [   26.567869][    C0]   lock(pgd_lock);
>>> [   26.568060][    C0]   <Interrupt>
>>> [   26.568255][    C0]     lock(pgd_lock);
>>> [   26.568452][    C0]
>>> [   26.568452][    C0]  *** DEADLOCK ***
>>> [   26.568452][    C0]
>>> [   26.568830][    C0] 3 locks held by stdbuf/386:
>>> [ 26.569056][ C0] #0: ffff888170d5c1a8 (&sb->s_type-
>>>> i_mutex_key){++++}-{4:4}, at: lookup_slow (fs/namei.c:1834)
>>> [ 26.569535][ C0] #1: ffff888170cf5850 (&lockref->lock){+.+.}-
>>> {3:3}, at: d_alloc (include/linux/dcache.h:319 fs/dcache.c:1777)
>>> [ 26.569961][ C0] #2: ffffc90000007d40 ((&mm->cid_timer)){+.-.}-
>>> {0:0}, at: call_timer_fn (kernel/time/timer.c:1744)
>>> [   26.570421][    C0]
>>> [   26.570421][    C0] stack backtrace:
>>> [   26.570704][    C0] CPU: 0 UID: 0 PID: 386 Comm: stdbuf Tainted:
>>> G                T   6.16.0-rc5-00002-gd06e66c6025e #1
>>> PREEMPT(voluntary)  39c5cbdaf5b4eb171776daa7d42daa95c0766676
>>> [   26.570716][    C0] Tainted: [T]=RANDSTRUCT
>>> [   26.570719][    C0] Call Trace:
>>> [   26.570723][    C0]  <IRQ>
>>> [ 26.570727][ C0] dump_stack_lvl (lib/dump_stack.c:122
>>> (discriminator 4))
>>> [ 26.570735][ C0] dump_stack (lib/dump_stack.c:130)
>>> [ 26.570740][ C0] print_usage_bug (kernel/locking/lockdep.c:4047)
>>> [ 26.570748][ C0] valid_state (kernel/locking/lockdep.c:4060)
>>> [ 26.570755][ C0] mark_lock_irq (kernel/locking/lockdep.c:4270)
>>> [ 26.570762][ C0] ? save_trace (kernel/locking/lockdep.c:592)
>>> [ 26.570773][ C0] ? mark_lock (kernel/locking/lockdep.c:4728
>>> (discriminator 3))
>>> [ 26.570780][ C0] mark_lock (kernel/locking/lockdep.c:4756)
>>> [ 26.570787][ C0] mark_usage (kernel/locking/lockdep.c:4645)
>>> [ 26.570796][ C0] __lock_acquire (kernel/locking/lockdep.c:5194)
>>> [ 26.570804][ C0] lock_acquire (kernel/locking/lockdep.c:473
>>> kernel/locking/lockdep.c:5873)
>>> [ 26.570811][ C0] ? pgd_free (arch/x86/mm/pgtable.c:67
>>> arch/x86/mm/pgtable.c:98 arch/x86/mm/pgtable.c:379)
>>> [ 26.570822][ C0] ? validate_chain (kernel/locking/lockdep.c:3826
>>> kernel/locking/lockdep.c:3879)
>>> [ 26.570828][ C0] ? wake_up_new_task (kernel/sched/core.c:10597)
>>> [ 26.570839][ C0] _raw_spin_lock
>>> (include/linux/spinlock_api_smp.h:134
>>> kernel/locking/spinlock.c:154)
>>> [ 26.570845][ C0] ? pgd_free (arch/x86/mm/pgtable.c:67
>>> arch/x86/mm/pgtable.c:98 arch/x86/mm/pgtable.c:379)
>>> [ 26.570854][ C0] pgd_free (arch/x86/mm/pgtable.c:67
>>> arch/x86/mm/pgtable.c:98 arch/x86/mm/pgtable.c:379)
>>> [ 26.570863][ C0] ? wake_up_new_task (kernel/sched/core.c:10597)
>>> [ 26.570873][ C0] __mmdrop (kernel/fork.c:681)
>>> [ 26.570882][ C0] ? wake_up_new_task (kernel/sched/core.c:10597)
>>> [ 26.570891][ C0] mmdrop (include/linux/sched/mm.h:55)
>>> [ 26.570901][ C0] task_mm_cid_scan (kernel/sched/core.c:10619
>>> (discriminator 3))
>>> [ 26.570910][ C0] ? lock_is_held (include/linux/lockdep.h:249)
>>> [ 26.570918][ C0] ? wake_up_new_task (kernel/sched/core.c:10597)
>>> [ 26.570928][ C0] call_timer_fn (arch/x86/include/asm/atomic.h:23
>>> include/linux/atomic/atomic-arch-fallback.h:457
>>> include/linux/jump_label.h:262 include/trace/events/timer.h:127
>>> kernel/time/timer.c:1748)
>>> [ 26.570935][ C0] ? trace_timer_base_idle
>>> (kernel/time/timer.c:1724)
>>> [ 26.570943][ C0] ? wake_up_new_task (kernel/sched/core.c:10597)
>>> [ 26.570953][ C0] ? wake_up_new_task (kernel/sched/core.c:10597)
>>> [ 26.570962][ C0] __run_timers (kernel/time/timer.c:1799
>>> kernel/time/timer.c:2372)
>>> [ 26.570970][ C0] ? add_timer_global (kernel/time/timer.c:2343)
>>> [ 26.570977][ C0] ? __kasan_check_write (mm/kasan/shadow.c:38)
>>> [ 26.570988][ C0] ? do_raw_spin_lock
>>> (arch/x86/include/asm/atomic.h:107 include/linux/atomic/atomic-
>>> arch-fallback.h:2170 include/linux/atomic/atomic-
>>> instrumented.h:1302 include/asm-generic/qspinlock.h:111
>>> kernel/locking/spinlock_debug.c:116)
>>> [ 26.570996][ C0] ? __raw_spin_lock_init
>>> (kernel/locking/spinlock_debug.c:114)
>>> [ 26.571006][ C0] __run_timer_base (kernel/time/timer.c:2385)
>>> [ 26.571014][ C0] run_timer_base (kernel/time/timer.c:2394)
>>> [ 26.571021][ C0] run_timer_softirq
>>> (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-
>>> fallback.h:457 include/linux/jump_label.h:262
>>> kernel/time/timer.c:342 kernel/time/timer.c:2406)
>>> [ 26.571028][ C0] handle_softirqs (arch/x86/include/asm/atomic.h:23
>>> include/linux/atomic/atomic-arch-fallback.h:457
>>> include/linux/jump_label.h:262 include/trace/events/irq.h:142
>>> kernel/softirq.c:580)
>>> [ 26.571039][ C0] __do_softirq (kernel/softirq.c:614)
>>> [ 26.571046][ C0] __irq_exit_rcu (kernel/softirq.c:453
>>> kernel/softirq.c:680)
>>> [ 26.571055][ C0] irq_exit_rcu (kernel/softirq.c:698)
>>> [ 26.571064][ C0] sysvec_apic_timer_interrupt
>>> (arch/x86/kernel/apic/apic.c:1050 arch/x86/kernel/apic/apic.c:1050)
>>> [   26.571076][    C0]  </IRQ>
>>> [   26.571078][    C0]  <TASK>
>>> [ 26.571081][ C0] asm_sysvec_apic_timer_interrupt
>>> (arch/x86/include/asm/idtentry.h:574)
>>> [ 26.571088][ C0] RIP: 0010:d_alloc (fs/dcache.c:1778)
>>> [ 26.571100][ C0] Code: 8d 7c 24 50 b8 ff ff 37 00 ff 83 f8 00 00
>>> 00 48 89 fa 48 c1 e0 2a 48 c1 ea 03 80 3c 02 00 74 05 e8 5f f3 f6
>>> ff 49 89 5c 24 50 <49> 8d bc 24 10 01 00 00 48 8d b3 20 01 00 00 e8
>>> 87 bc ff ff 4c 89
>>> All code
>>> ========
>>>      0:	8d 7c 24 50          	lea    0x50(%rsp),%edi
>>>      4:	b8 ff ff 37 00       	mov    $0x37ffff,%eax
>>>      9:	ff 83 f8 00 00 00    	incl   0xf8(%rbx)
>>>      f:	48 89 fa             	mov    %rdi,%rdx
>>>     12:	48 c1 e0 2a          	shl    $0x2a,%rax
>>>     16:	48 c1 ea 03          	shr    $0x3,%rdx
>>>     1a:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
>>>     1e:	74 05                	je     0x25
>>>     20:	e8 5f f3 f6 ff       	call   0xfffffffffff6f384
>>>     25:	49 89 5c 24 50       	mov    %rbx,0x50(%r12)
>>>     2a:*	49 8d bc 24 10 01 00 	lea
>>> 0x110(%r12),%rdi		<-- trapping instruction
>>>     31:	00
>>>     32:	48 8d b3 20 01 00 00 	lea    0x120(%rbx),%rsi
>>>     39:	e8 87 bc ff ff       	call   0xffffffffffffbcc5
>>>     3e:	4c                   	rex.WR
>>>     3f:	89                   	.byte 0x89
>>>
>>> Code starting with the faulting instruction
>>> ===========================================
>>>      0:	49 8d bc 24 10 01 00 	lea    0x110(%r12),%rdi
>>>      7:	00
>>>      8:	48 8d b3 20 01 00 00 	lea    0x120(%rbx),%rsi
>>>      f:	e8 87 bc ff ff       	call   0xffffffffffffbc9b
>>>     14:	4c                   	rex.WR
>>>     15:	89                   	.byte 0x89
>>>
>>>
>>> The kernel config and materials to reproduce are available at:
>>> https://download.01.org/0day-ci/archive/20250710/202507100606.90787fe6-lkp@intel.com
>>>
>>>
>>>
>>
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ