linux-kernel - Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250925083533.GW4067720@noisy.programming.kicks-ass.net>
Date: Thu, 25 Sep 2025 10:35:33 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Tejun Heo <tj@...nel.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de
Subject: Re: [PATCH 12/14] sched: Add shared runqueue locking to
 __task_rq_lock()

Hi! Sorry for the delay,

On Tue, Sep 16, 2025 at 12:41:54PM -1000, Tejun Heo wrote:

> On Tue, Sep 16, 2025 at 12:29:57PM -1000, Tejun Heo wrote:
> ...
> > Long term, I think maintaining flexibility is of higher importance for
> > sched_ext than e.g. small performance improvements or even design or
> > implementation aesthetics. The primary purpose is enabling trying out new,
> > sometimes wild, things after all. As such, I don't think it'd be a good idea
> > to put strict restrictions on how the BPF side operates unless it affects
> > the ability to recover the system from a malfunctioning BPF scheduler, of
> > course.
> 
> Thinking a bit more about it. I wonder the status-quo is actually an okay
> balance. All in-kernel sched classes are per-CPU rich rq design, which
> meshes well with the current locking scheme, for obvious reasons.
> 
> sched_ext is an oddball in that it may want to hot-migrate tasks at the last
> minute because who knows what the BPF side wants to do. However, this just
> boils down to having to always call balance() before any pick_task()
> attempts (including DL server case). Yeah, it's a niggle, especially as
> there needs to be a secondary hook to handle losing the race between
> balance() and pick_task(), but it's pretty contained conceptually and not a
> lot of code.

Status quo isn't sufficient; there is that guy that wants to fix some RT
interaction, and there is that dl_server series.

The only viable option other than overhauling the locking, is pushing rf
into pick_task() and have that do all the lock dancing. This gets rid of
that balance abuse (which is needed for dl_server) and also allows
fixing that rt thing.

It just makes a giant mess of pick_task_scx() which might have to drop
locks and retry/abort -- which you weren't very keen on, but yeah, it
should work.

As to letting BPF do wild experiments; that's fine of course, but not
exposing the actual locking requirements is like denying reality. You
can't do lock-break in pick_task_scx() and then claim lockless or
advanced locking -- that's just not true.

Also, you cannot claim bpf-sched author is clever enough to implement
advanced locking, but then somehow not clever enough to deal with a
simple interface to express locking to the core code. That feels
disingenuous.

For all the DSQ based schedulers, this new locking really is an
improvement, but if you don't want to constrain bpf-sched authors to
reality, then perhaps only do the lock break dance for them?

Anyway, I'll go poke at this series again -- the latest queue.git
version seemed to work reliably for me (I could run stress-ng while
having scx_simple loaded), but the robot seems to have found an issue.