linux-kernel - Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250929100658.GC3245006@noisy.programming.kicks-ass.net>
Date: Mon, 29 Sep 2025 12:06:58 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Tejun Heo <tj@...nel.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de
Subject: Re: [PATCH 12/14] sched: Add shared runqueue locking to
 __task_rq_lock()

On Fri, Sep 26, 2025 at 11:39:21AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 26, 2025 at 12:36:28PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 25, 2025 at 11:43:18AM -1000, Tejun Heo wrote:
> > > Yes, I was on a similar train of thought. The only reasonable way that I can
> > > think of for solving this for BPF managed tasks is giving each task its own
> > > inner sched lock, which makes sense as all sched operations (except for
> > > things like watchdog) are per-task and we don't really need wider scope
> > > locking.
> > 
> > Like I've said before; I really don't understand how that would be
> > helpful at all.
> > 
> > How can you migrate a task by holding a per-task lock?
> 
> Let's see whether I'm completely confused. Let's say we have p->sub_lock
> which is optionally grabbed by task_rq_lock() if requested by the current
> sched class (maybe it's a sched_class flag). Then, whoever is holding the
> sub_lock would exclude property and other changes to the task.
> 
> In sched_ext, let's say p->sub_lock nests inside dsq locks. Also, right now,
> we're piggy backing on rq lock for local DSQs. We'd need to make local DSQs
> use their own locks like user DSQs. Then,
> 
> - If a task needs to be migrated either during enqueue through
>   process_ddsp_deferred_locals() or during dispatch from BPF through
>   finish_dispatch(): Leave rq locks alone. Grab sub_lock inside
>   dispatch_to_local_dsq() after grabbing the target DSQ's lock.
> 
> - scx_bpf_dsq_move_to_local() from dispatch: This is a bit tricky as we need
>   to scan the tasks on the source DSQ to find the task to dispatch. However,
>   there's a patch being worked on to add rcu protected pointer to the first
>   task which would be the task to be consumed in vast majority of cases, so
>   the fast path wouldn't be complicated - grab sub_lock, do the moving. If
>   the first task isn't a good candidate, we'd have to grab DSQ lock, iterate
>   looking for the right candidate, unlock DSQ and grab sub_lock (or
>   trylock), and see if the task is still on the DSQ and then relock and
>   remove.
> 
> - scx_bpf_dsq_move() during BPF iteration: DSQ is unlocked during each
>   iteration visit, so this is straightforward. Grab sub-lock and do the rest
>   the same.
> 
> Wouldn't something like the above provide equivalent synchronization as the
> dynamic lock approach? Whoever is holding sub_lock would be guaranteed that
> the task won't be migrating while the lock is held.
> 
> However, thinking more about it. I'm unsure how e.g. the actual migration
> would work. The actual migration is done by: deactivate_task() ->
> set_task_cpu() -> switch rq locks -> activate_task(). Enqueueing/dequeueing
> steps have operations that depend on rq lock - psi updates, uclamp updates
> and so on. How would they work?

Suppose __task_rq_lock() will take rq->lock and p->sub_lock, in that
order, such that task_rq_lock() will take p->pi_lock, rq->lock and
p->sub_lock.

Then something like:

  guard(task_rq_lock)(p);
  scoped_guard (sched_change, p, ...) {
      // change me
  }

Will end up doing something like:

  // task_rq_lock
  IRQ-DISABLE
  LOCK pi->lock
1:
  rq = task_rq(p);
  LOCK rq->lock;
  if (rq != task_rq(p)) {
    UNLOCK rq->lock
    goto 1;
  }
  LOCK p->sub_lock

  // sched_change
  dequeue_task() := dequeue_task_scx()
    LOCK dsq->lock

While at the same time, above you argued p->sub_lock should be inside
dsq->lock. Because:

__schedule()
  rq = this_rq();
  LOCK rq->lock
  next = pick_next() := pick_next_scx()
    LOCK dsq->lock
    p = find_task(dsq);
    LOCK p->sub_lock
    dequeue(dsq, p);
    UNLOCK dsq->lock

Because if you did something like:

__schedule()
  rq = this_rq();
  LOCK rq->lock
  next = pick_next() := pick_next_scx()
    LOCK dsq->lock (or RCU, doesn't matter)
    p = find_task(dsq);
    UNLOCK dsq->lock
				migrate:
				LOCK p->pi_lock
				rq = task_rq(p)
				LOCK rq->lock
				(verify bla bla)
				LOCK p->sub_lock
				LOCK dsq->lock
				dequeue(dsq, p)
				UNLOCK dsq->lock
				set_task_cpu(n);
				UNLOCK rq->lock
				rq = cpu_rq(n);
				LOCK rq->lock (inversion vs p->sub_lock)
				LOCK dsq2->lock
				enqueue(dsq2, p)
				UNLOCK dsq2->lock

    LOCK p->sub_lock
    LOCK dsq->lock   (whoopsie, p is on dsq2)
    dequeue(dsq, p)
    set_task_cpu(here);
    UNLOCK dsq->lock


That is, either way around: dsq->lock outside, p->sub_lock inside, or
the other way around, I emd up with inversions and race conditions that
are not fun.

Also, if you do put p->sub_lock inside dsq->lock, this means
__task_rq_lock() cannot take it and it needs to be pushed deep into scx
(possibly into bpf ?) and that means I'm not sure how to do the change
pattern sanely.

Having __task_rq_lock() take p->dsq->lock solves all these problems,
except for that one weird case where BPF wants to do things their own
way. The longer I'm thinking about it, the more I dislike that. I just
don't see *ANY* upside from allowing BPF to do this while it is making
everything else quite awkward.

The easy fix is to have these BPF managed things have a single global
lock. That works and is correct. Then if they want something better,
they can use DSQs :-)

Fundamentally, we need the DSQ->lock to cover all CPUs that will pick
from it, there is no wiggle room there. Also note that while we change
only the attributes of a single task with the change pattern, that
affects the whole RQ, since a runqueue is an aggregate of all tasks.
This is very much why dequeue/enqueue around the change pattern, to keep
the runqueue aggregates updated.

Use the BPF thing to play with scheduling policies, but leave the
locking to the core code.