linux-kernel - Re: [RFC 1/1] sched: Skip redundant operations for proxy tasks needing return migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANDhNCrtdH-McjbMYkXHNDbB4S=ePREM=Jz6ciOkzjMerND+Vg@mail.gmail.com>
Date: Fri, 13 Jun 2025 22:20:13 -0700
From: John Stultz <jstultz@...gle.com>
To: hupu <hupu.gm@...il.com>
Cc: peterz@...radead.org, linux-kernel@...r.kernel.org, juri.lelli@...hat.com, 
	vschneid@...hat.com, mingo@...hat.com, vincent.guittot@...aro.org, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, hupu@...nssion.com, Suleiman Souhlal <suleiman@...gle.com>
Subject: Re: [RFC 1/1] sched: Skip redundant operations for proxy tasks
 needing return migration

On Thu, Jun 12, 2025 at 2:57 AM hupu <hupu.gm@...il.com> wrote:
>
> Hi John
>
> > Unfortunately I've also been tripping over a crash with the full
> > series (not the portion I'm currently submitting upstream) that often
> > takes 30+ hours to reproduce and extra debugging logic seems to make
> > it even harder to hit.
> >
>
> I encountered a strange crash while running the code from the
> proxy-exec-v17-WIP branch for over 150+ hours. I'm currently unsure if
> this issue is the same as the one you mentioned.
>
> Here is the dmesg log at the time of the crash:
>
> [550195.352234][    C2] watchdog: BUG: soft lockup - CPU#2 stuck for
> 32s! [lock_torture_wr:139]
> [550195.374670][    C2] Modules linked in:
> [550195.375936][    C2] irq event stamp: 189580674
> [550195.376275][    C2] hardirqs last  enabled at (189580673):
> [<ffff8000813b7840>] exit_to_kernel_mode.isra.6+0xf8/0x130
> [550195.393675][    C2] hardirqs last disabled at (189580674):
> [<ffff8000813b78a4>] el1_interrupt+0x2c/0x70
> [550195.393779][    C2] softirqs last  enabled at (189580644):
> [<ffff80008005f7dc>] handle_softirqs+0x4b4/0x4cc
> [550195.396778][    C2] softirqs last disabled at (189580639):
> [<ffff800080010654>] __do_softirq+0x18/0x20
> [550195.406109][    C2] CPU: 2 UID: 0 PID: 139 Comm: lock_torture_wr
> Not tainted 6.12.0-00050-gf6c6e0ea8a1d-dirty #133
> [550195.406661][    C2] Hardware name: linux,dummy-virt (DT)
> [550195.407172][    C2] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO
> -DIT -SSBS BTYPE=--)
> [550195.407396][    C2] pc : queued_spin_lock_slowpath+0x270/0x528
> [550195.408270][    C2] lr : queued_spin_lock_slowpath+0x3a8/0x528
> [550195.408358][    C2] sp : ffff800084d83d20
> [550195.408494][    C2] x29: ffff800084d83d20 x28: 28f5c28f5c28f5c3
> x27: 0000000000000000
> [550195.408985][    C2] x26: ffff800081ef3000 x25: 0000000000000000
> x24: 0000000000000002
> [550195.409158][    C2] x23: ffff800083532000 x22: ffff800082b55000
> x21: ffff00002c286740
> [550195.409322][    C2] x20: ffff800081ef3000 x19: ffff80008355b348
> x18: ffff800081ce6e48
> [550195.409522][    C2] x17: ffff7fffa954e000 x16: 000000006c0014e7
> x15: ffff8000835321f0
> [550195.409681][    C2] x14: 0001f45e95248cf0 x13: 0000000000000100
> x12: 0000000000000000
> [550195.409875][    C2] x11: 00000000002dc6c0 x10: 0000000000000000 x9
> : ffff7fffa9731000
> [550195.410102][    C2] x8 : ffff800081ef3fb8 x7 : 0000000000000000 x6
> : 00000000000c0000
> [550195.410262][    C2] x5 : ffff8000835365e8 x4 : 0000000000000000 x3
> : 0000000000000000
> [550195.410451][    C2] x2 : 0000000000000000 x1 : ffff00002c286740 x0
> : ffff00002c286748
> [550195.410849][    C2] Call trace:
> [550195.410995][    C2]  queued_spin_lock_slowpath+0x270/0x528
> [550195.411265][    C2]  do_raw_spin_lock+0xac/0x11c
> [550195.421183][    C2]  _raw_spin_lock+0x74/0x90
> [550195.421264][    C2]  torture_spin_lock_write_lock+0x1c/0x28
> [550195.421333][    C2]  lock_torture_writer+0x104/0x2f4
> [550195.421397][    C2]  kthread+0x128/0x134
> [550195.421600][    C2]  ret_from_fork+0x10/0x20
> [550195.422007][    C1] watchdog: BUG: soft lockup - CPU#1 stuck for
> 32s! [lock_torture_wr:144]
> [550195.422780][    C2] Kernel panic - not syncing: softlockup: hung tasks
> [550195.423616][    C2] CPU: 2 UID: 0 PID: 139 Comm: lock_torture_wr
> Tainted: G             L     6.12.0-00050-gf6c6e0ea8a1d-dirty #133
> [550195.423813][    C2] Tainted: [L]=SOFTLOCKUP
> [550195.423929][    C2] Hardware name: linux,dummy-virt (DT)
> [550195.424123][    C2] Call trace:
> [550195.424174][    C2]  dump_backtrace.part.8+0x8c/0xe8
> [550195.424232][    C2]  show_stack+0x2c/0x38
> [550195.424256][    C2]  dump_stack_lvl+0x6c/0xcc
> [550195.424999][    C2]  dump_stack+0x14/0x1c
> [550195.425050][    C2]  panic+0x150/0x428
> [550195.425121][    C2]  watchdog_timer_fn+0x258/0x2e8
> [550195.426192][    C2]  __hrtimer_run_queues+0x3bc/0x4a0
> [550195.426651][    C2]  hrtimer_interrupt+0xd8/0x248
> [550195.426680][    C2]  arch_timer_handler_virt+0x38/0x58
> [550195.427608][    C2]  handle_percpu_devid_irq+0xac/0x2cc
> [550195.427676][    C2]  generic_handle_domain_irq+0x40/0x58
> [550195.427743][    C2]  gic_handle_irq+0x64/0x124
> [550195.427771][    C2]  call_on_irq_stack+0x24/0x34
> [550195.427832][    C2]  do_interrupt_handler+0x90/0x94
> [550195.427865][    C2]  el1_interrupt+0x3c/0x70
> [550195.427899][    C2]  el1h_64_irq_handler+0x14/0x1c
> [550195.427960][    C2]  el1h_64_irq+0x64/0x68
> [550195.427989][    C2]  queued_spin_lock_slowpath+0x270/0x528
> [550195.428021][    C2]  do_raw_spin_lock+0xac/0x11c
> [550195.428051][    C2]  _raw_spin_lock+0x74/0x90
> [550195.428082][    C2]  torture_spin_lock_write_lock+0x1c/0x28
> [550195.428115][    C2]  lock_torture_writer+0x104/0x2f4
> [550195.428147][    C2]  kthread+0x128/0x134
> [550195.428182][    C2]  ret_from_fork+0x10/0x20
> [550195.428564][    C2] SMP: stopping secondary CPUs
> [550195.448256][    C2] Kernel Offset: disabled
> [550195.450640][    C2] CPU features: 0x00,00000019,40203020,560042af
> [550195.451534][    C2] Memory Limit: none
> [550195.452853][    C2] ---[ end Kernel panic - not syncing:
> softlockup: hung tasks ]---
>

So, while I'm not sure if this is the same issue, I have found fixes
for a few issues (but unfortunately not all) I was hitting with v17.

Not exactly what your seeing, but  I was seeing some issues where due
to a spurious wakeup we would end up with an owner that was woken but
was still in the BO_WAKING state and hadn't been return migrated. This
could cause us to get stuck looping in and out of find_proxy_task(),
which could cause tasks to lock up in a similar fashion.  So I've
re-added the return migration for that case and it seems to be
resolved.

I've not seen a case where a spinlock was unowned but everyone was
spinning on it (particularly because my patch series doesn't really
mess with spinlocks, assuming the runnable/spinning tasks get
scheduled to run). The owner being stuck and not letting the lock go
seems more likely, so that's curious.

I've got my current WIP set here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-WIP/
(Still need to write proper commit messages for many of the new changes)

The problem I see right now, which can sometimes take 30+ hours to
trip on an 64core qemu system running cyclictest, my
prio-inversion-demo test, and running the in-kernel test_ww_mutex
tests in loops.

The backtrace will look something like:
[18071.849807] ------------[ cut here ]------------
[18071.849829]
[18071.849830] ======================================================
[18071.849831] WARNING: possible circular locking dependency detected
[18071.849832] 6.12.0-00060-g6b88c4fed2ba #1362 Not tainted
[18071.849834] ------------------------------------------------------
[18071.849835] kworker/u256:8/940 is trying to acquire lock:
[18071.849837] ffffffff82ffaf38 ((console_sem).lock){-...}-{2:2}
[18071.849887] BUG: unable to handle page fault for address: 0000004800153ad9
[18071.878733] #PF: supervisor read access in kernel mode
[18071.881809] #PF: error_code(0x0000) - not-present page
[18071.884840] PGD 0 P4D 0
[18071.886351] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[18071.889305] CPU: 11 UID: 0 PID: 940 Comm: kworker/u256:8 Not
tainted 6.12.0-00060-g6b88c4fed2ba #1362
[18071.894553] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[18071.900009] Workqueue: test-ww_mutex stress_one_work
[18071.902902] RIP: 0010:pick_next_task_fair+0x2c5/0x3f0
[18071.905860] Code: b5 90 00 00 00 45 39 f7 7e 5c 49 8b bc 24 a0 00
00 00 4c 89 e6 e8 4b a3 ff ff 4d 8b a4 24 98 00 00 00 4d 8b bc 24 a0
00 00 00 <49> 8b bd a0 00 00 00 49 39 ff 75 c0 4d 85 ff 74 bb 4c 89 ee
4c 89
[18071.916706] RSP: 0018:ffffc90002877c38 EFLAGS: 00010006
[18071.919729] RAX: ffff888105810000 RBX: ffff88810087a140 RCX: 0000000000000000
[18071.923810] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[18071.927894] RBP: ffff8881b9c6fa00 R08: 0000000000000000 R09: ffff888103fbc280
[18071.931983] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff810ec0c0
[18071.936051] R13: 0000004800153a39 R14: 0000000000000000 R15: 00441f0ffa1e0ff3
[18071.940143] FS:  0000000000000000(0000) GS:ffff8881b90c0000(0000)
knlGS:0000000000000000
[18071.944776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[18071.948085] CR2: 0000004800153ad9 CR3: 0000000112512004 CR4: 0000000000370ef0
[18071.952210] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[18071.956277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[18071.960375] Call Trace:
[18071.961826]  <TASK>
[18071.963097]  ? __die+0x56/0x92
[18071.964919]  ? page_fault_oops+0xca/0x260
[18071.967246]  ? down_trylock+0x13/0x40
[18071.969400]  ? _printk+0x6c/0x87
[18071.971246]  ? __down_trylock_console_sem+0x23/0x90
[18071.974005]  ? exc_page_fault+0x525/0x8e0
[18071.976276]  ? asm_exc_page_fault+0x26/0x30
[18071.978614]  ? put_prev_entity+0x70/0xb0
[18071.980861]  ? pick_next_task_fair+0x2c5/0x3f0
[18071.983409]  ? pick_next_task_fair+0x309/0x3f0
[18071.985977]  __schedule+0x22f/0x20c0
[18071.988041]  ? lock_release+0x16d/0x2e0
[18071.990248]  schedule+0x3d/0x150
[18071.992197]  schedule_hrtimeout_range_clock+0x110/0x160
[18071.995155]  ? __pfx_hrtimer_wakeup+0x10/0x10
[18071.997679]  ? usleep_range_state+0x59/0xc0
[18072.000063]  usleep_range_state+0x87/0xc0
[18072.002341]  stress_one_work+0x41/0x90
[18072.004465]  process_one_work+0x204/0x5a0
[18072.006793]  ? lock_is_held_type+0xca/0x130
[18072.009195]  worker_thread+0x18e/0x350
[18072.011328]  ? __pfx_worker_thread+0x10/0x10
[18072.013773]  kthread+0xee/0x120
[18072.015593]  ? __pfx_kthread+0x10/0x10
[18072.017805]  ret_from_fork+0x31/0x50
[18072.019803]  ? __pfx_kthread+0x10/0x10
[18072.021948]  ret_from_fork_asm+0x1a/0x30
[18072.024214]  </TASK>

Which is due to tripping the:
  SCHED_WARN_ON(cfs_rq->curr != prev);
check in put_prev_entity().

And then hitting some sort of memory trouble as the lockdep warning is
being printed.

I suspect the issue is that the cfs_rq->curr (which should match the
rq->donor set from the previous pick_next_task()) ends up as a
dangling reference when we migrate or otherwise re-do the pick_again
loop.

Unfortunately as I add more trace_printks to debug the issue, it
becomes even more rare to trip, or the log somehow ends up incomplete
(missing final lines I'd expect given the backtrace in the splat),  so
it has been slow progress. :(

Anyway, let me know if you run into anything similar or have ideas.
I'll be continuing to try to trap the behavior better, running over
this weekend.

thanks
-john