linux-kernel - Re: [PATCH 00/14] sched: Support shared runqueue locking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aMG2HAWhgAYBdh6Q@gpd4>
Date: Wed, 10 Sep 2025 19:32:12 +0200
From: Andrea Righi <arighi@...dia.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: tj@...nel.org, linux-kernel@...r.kernel.org, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, longman@...hat.com,
	hannes@...xchg.org, mkoutny@...e.com, void@...ifault.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de
Subject: Re: [PATCH 00/14] sched: Support shared runqueue locking

Hi Peter,

thanks for jumping on this. Comment below.

On Wed, Sep 10, 2025 at 05:44:09PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
> is down to the core code not quite working right for shared runqueue stuff.
> 
> Instead of endlessly hacking around that, bite the bullet and fix it all up.
> 
> With these patches, it should be possible to clean up pick_task_scx() to not
> rely on balance_scx(). Additionally it should be possible to fix that RT issue,
> and the dl_server issue without further propagating lock breaks.
> 
> As is, these patches boot and run/pass selftests/sched_ext with lockdep on.
> 
> I meant to do more sched_ext cleanups, but since this has all already taken
> longer than I would've liked (real life interrupted :/), I figured I should
> post this as is and let TJ/Andrea poke at it.
> 
> Patches are also available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup
> 
> 
> [1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net

I've done a quick test with this patch set applied and I was able to
trigger this:

[   49.746281] ============================================
[   49.746457] WARNING: possible recursive locking detected
[   49.746559] 6.17.0-rc4-virtme #85 Not tainted
[   49.746666] --------------------------------------------
[   49.746763] stress-ng-race-/5818 is trying to acquire lock:
[   49.746856] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: dispatch_dequeue+0x125/0x1f0
[   49.747052]
[   49.747052] but task is already holding lock:
[   49.747234] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
[   49.747416]
[   49.747416] other info that might help us debug this:
[   49.747557]  Possible unsafe locking scenario:
[   49.747557]
[   49.747689]        CPU0
[   49.747740]        ----
[   49.747793]   lock(&dsq->lock);
[   49.747867]   lock(&dsq->lock);
[   49.747950]
[   49.747950]  *** DEADLOCK ***
[   49.747950]
[   49.748086]  May be due to missing lock nesting notation
[   49.748086]
[   49.748197] 3 locks held by stress-ng-race-/5818:
[   49.748335]  #0: ffff890e0f0fce70 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x38/0x170
[   49.748474]  #1: ffff890e3b6bcc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0xa0
[   49.748652]  #2: ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170

Reproducer:

 $ cd tools/sched_ext
 $ make scx_simple
 $ sudo ./build/bin/scx_simple
 ... and in another shell
 $ stress-ng --race-sched 0

I added an explicit BUG_ON() to see where the double locking is happening:

[   15.160400] Call Trace:
[   15.160706]  dequeue_task_scx+0x14a/0x270
[   15.160857]  move_queued_task+0x7d/0x2d0
[   15.160952]  affine_move_task+0x6ca/0x700
[   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
[   15.161348]  __sched_setaffinity+0x72/0x100
[   15.161459]  sched_setaffinity+0x261/0x2f0
[   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
[   15.161705]  do_syscall_64+0xbb/0x370
[   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?

Thanks,
-Andrea