linux-kernel - Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aNW3du48v3PvwPbq@slm.duckdns.org>
Date: Thu, 25 Sep 2025 11:43:18 -1000
From: Tejun Heo <tj@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de
Subject: Re: [PATCH 12/14] sched: Add shared runqueue locking to
 __task_rq_lock()

Hello,

On Thu, Sep 25, 2025 at 10:35:33AM +0200, Peter Zijlstra wrote:
> > sched_ext is an oddball in that it may want to hot-migrate tasks at the last
> > minute because who knows what the BPF side wants to do. However, this just
> > boils down to having to always call balance() before any pick_task()
> > attempts (including DL server case). Yeah, it's a niggle, especially as
> > there needs to be a secondary hook to handle losing the race between
> > balance() and pick_task(), but it's pretty contained conceptually and not a
> > lot of code.
> 
> Status quo isn't sufficient; there is that guy that wants to fix some RT
> interaction, and there is that dl_server series.

Can you point me to the RT interaction issue?

Just for context, from sched_ext side, the two pending issues are:

- An extra hook after the next task to run is picked regardless of sched
  class to fix ops.cpu_acquire/release().

- Invoking sched_ext's balance() if its DL server is going to run. This will
  be the same place as the balance() calling for pick_task() and it
  shouldn't be too difficult to package them together so that they're a bit
  less crufty.

Both can be addressed in a neater way if we can pick_task() atomically, and
that will likely make other things easier too. However, it also isn't like
the benefits are overwhelming depending on how the overall tradeoff comes
out to be.

> The only viable option other than overhauling the locking, is pushing rf
> into pick_task() and have that do all the lock dancing. This gets rid of
> that balance abuse (which is needed for dl_server) and also allows
> fixing that rt thing.
> 
> It just makes a giant mess of pick_task_scx() which might have to drop
> locks and retry/abort -- which you weren't very keen on, but yeah, it
> should work.

It does feel really fragile tho. Introducing an extra inner locking layer
makes sense to me. I feel nervous about interlocking around dynamic lock
pointer. It feels too easy to make subtle mistakes in terms of update and
visibility rules. It seems too smart to me. I'd much prefer it to be a bit
dumber.

> As to letting BPF do wild experiments; that's fine of course, but not
> exposing the actual locking requirements is like denying reality. You
> can't do lock-break in pick_task_scx() and then claim lockless or
> advanced locking -- that's just not true.
> 
> Also, you cannot claim bpf-sched author is clever enough to implement
> advanced locking, but then somehow not clever enough to deal with a
> simple interface to express locking to the core code. That feels
> disingenuous.

It's not about cleverness but more about the gap between the two execution
environments. For example, the following is pure BPF spinlock implementation
that some started using:

 https://github.com/sched-ext/scx/blob/main/scheds/include/scx/bpf_arena_spin_lock.h

Kernel isn't involved in any way. It's BPF code doing atomic ops, managing
waiters and also giving up if reasonable forward progress can't be made.
It's all on BPF arena memory, which is not only readable but also writable
from userspace too. All of this is completely opaque to the kernel. It is
all safe from BPF side, but I don't see how we could interlock with
something like this from kernel side, and we do want BPF to be able to do
things like this.

> For all the DSQ based schedulers, this new locking really is an
> improvement, but if you don't want to constrain bpf-sched authors to
> reality, then perhaps only do the lock break dance for them?

Yes, I was on a similar train of thought. The only reasonable way that I can
think of for solving this for BPF managed tasks is giving each task its own
inner sched lock, which makes sense as all sched operations (except for
things like watchdog) are per-task and we don't really need wider scope
locking.

So, let's say we do that for BPF tasks. Then, why not do the same thing for
DSQ tasks too? It will provide all the necessary synchronization guarantees.
The only downside for tasks on DSQ is that we'll grab one more lock instead
of piggy-backing on the DSQ locks. However, going back to my queasiness
about dynamic lock pointers, I'd rather go for the dumber thing even if it's
slightly less efficient and I'm really doubtful that we'd notice the extra
lock overhead in any practical way.

Thank you.

-- 
tejun