lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250929100658.GC3245006@noisy.programming.kicks-ass.net>
Date: Mon, 29 Sep 2025 12:06:58 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Tejun Heo <tj@...nel.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de
Subject: Re: [PATCH 12/14] sched: Add shared runqueue locking to
 __task_rq_lock()

On Fri, Sep 26, 2025 at 11:39:21AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 26, 2025 at 12:36:28PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 25, 2025 at 11:43:18AM -1000, Tejun Heo wrote:
> > > Yes, I was on a similar train of thought. The only reasonable way that I can
> > > think of for solving this for BPF managed tasks is giving each task its own
> > > inner sched lock, which makes sense as all sched operations (except for
> > > things like watchdog) are per-task and we don't really need wider scope
> > > locking.
> > 
> > Like I've said before; I really don't understand how that would be
> > helpful at all.
> > 
> > How can you migrate a task by holding a per-task lock?
> 
> Let's see whether I'm completely confused. Let's say we have p->sub_lock
> which is optionally grabbed by task_rq_lock() if requested by the current
> sched class (maybe it's a sched_class flag). Then, whoever is holding the
> sub_lock would exclude property and other changes to the task.
> 
> In sched_ext, let's say p->sub_lock nests inside dsq locks. Also, right now,
> we're piggy backing on rq lock for local DSQs. We'd need to make local DSQs
> use their own locks like user DSQs. Then,
> 
> - If a task needs to be migrated either during enqueue through
>   process_ddsp_deferred_locals() or during dispatch from BPF through
>   finish_dispatch(): Leave rq locks alone. Grab sub_lock inside
>   dispatch_to_local_dsq() after grabbing the target DSQ's lock.
> 
> - scx_bpf_dsq_move_to_local() from dispatch: This is a bit tricky as we need
>   to scan the tasks on the source DSQ to find the task to dispatch. However,
>   there's a patch being worked on to add rcu protected pointer to the first
>   task which would be the task to be consumed in vast majority of cases, so
>   the fast path wouldn't be complicated - grab sub_lock, do the moving. If
>   the first task isn't a good candidate, we'd have to grab DSQ lock, iterate
>   looking for the right candidate, unlock DSQ and grab sub_lock (or
>   trylock), and see if the task is still on the DSQ and then relock and
>   remove.
> 
> - scx_bpf_dsq_move() during BPF iteration: DSQ is unlocked during each
>   iteration visit, so this is straightforward. Grab sub-lock and do the rest
>   the same.
> 
> Wouldn't something like the above provide equivalent synchronization as the
> dynamic lock approach? Whoever is holding sub_lock would be guaranteed that
> the task won't be migrating while the lock is held.
> 
> However, thinking more about it. I'm unsure how e.g. the actual migration
> would work. The actual migration is done by: deactivate_task() ->
> set_task_cpu() -> switch rq locks -> activate_task(). Enqueueing/dequeueing
> steps have operations that depend on rq lock - psi updates, uclamp updates
> and so on. How would they work?

Suppose __task_rq_lock() will take rq->lock and p->sub_lock, in that
order, such that task_rq_lock() will take p->pi_lock, rq->lock and
p->sub_lock.

Then something like:

  guard(task_rq_lock)(p);
  scoped_guard (sched_change, p, ...) {
      // change me
  }

Will end up doing something like:

  // task_rq_lock
  IRQ-DISABLE
  LOCK pi->lock
1:
  rq = task_rq(p);
  LOCK rq->lock;
  if (rq != task_rq(p)) {
    UNLOCK rq->lock
    goto 1;
  }
  LOCK p->sub_lock

  // sched_change
  dequeue_task() := dequeue_task_scx()
    LOCK dsq->lock

While at the same time, above you argued p->sub_lock should be inside
dsq->lock. Because:

__schedule()
  rq = this_rq();
  LOCK rq->lock
  next = pick_next() := pick_next_scx()
    LOCK dsq->lock
    p = find_task(dsq);
    LOCK p->sub_lock
    dequeue(dsq, p);
    UNLOCK dsq->lock

Because if you did something like:

__schedule()
  rq = this_rq();
  LOCK rq->lock
  next = pick_next() := pick_next_scx()
    LOCK dsq->lock (or RCU, doesn't matter)
    p = find_task(dsq);
    UNLOCK dsq->lock
				migrate:
				LOCK p->pi_lock
				rq = task_rq(p)
				LOCK rq->lock
				(verify bla bla)
				LOCK p->sub_lock
				LOCK dsq->lock
				dequeue(dsq, p)
				UNLOCK dsq->lock
				set_task_cpu(n);
				UNLOCK rq->lock
				rq = cpu_rq(n);
				LOCK rq->lock (inversion vs p->sub_lock)
				LOCK dsq2->lock
				enqueue(dsq2, p)
				UNLOCK dsq2->lock

    LOCK p->sub_lock
    LOCK dsq->lock   (whoopsie, p is on dsq2)
    dequeue(dsq, p)
    set_task_cpu(here);
    UNLOCK dsq->lock


That is, either way around: dsq->lock outside, p->sub_lock inside, or
the other way around, I emd up with inversions and race conditions that
are not fun.

Also, if you do put p->sub_lock inside dsq->lock, this means
__task_rq_lock() cannot take it and it needs to be pushed deep into scx
(possibly into bpf ?) and that means I'm not sure how to do the change
pattern sanely.

Having __task_rq_lock() take p->dsq->lock solves all these problems,
except for that one weird case where BPF wants to do things their own
way. The longer I'm thinking about it, the more I dislike that. I just
don't see *ANY* upside from allowing BPF to do this while it is making
everything else quite awkward.

The easy fix is to have these BPF managed things have a single global
lock. That works and is correct. Then if they want something better,
they can use DSQs :-)

Fundamentally, we need the DSQ->lock to cover all CPUs that will pick
from it, there is no wiggle room there. Also note that while we change
only the attributes of a single task with the change pattern, that
affects the whole RQ, since a runqueue is an aggregate of all tasks.
This is very much why dequeue/enqueue around the change pattern, to keep
the runqueue aggregates updated.

Use the BPF thing to play with scheduling policies, but leave the
locking to the core code.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ