linux-kernel - Re: [RFC][PATCH 0/3] sched/ext: Cleanup pick_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251008091151.GS4067720@noisy.programming.kicks-ass.net>
Date: Wed, 8 Oct 2025 11:11:51 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Tejun Heo <tj@...nel.org>
Cc: linux-kernel@...r.kernel.org, mingo@...nel.org, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de,
	alexei.starovoitov@...il.com
Subject: Re: [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx()

On Tue, Oct 07, 2025 at 11:48:15AM -1000, Tejun Heo wrote:
> On Mon, Oct 06, 2025 at 12:46:52PM +0200, Peter Zijlstra wrote:
> > Hi,
> > 
> > So I had a poke at 'give @rf to pick_task() and fold balance_scx() into
> > pick_task_scx()' option to see how terrible it was. Turns out, not terrible at
> > all.
> > 
> > I've ran the sched_ext selftest and stress-ng --race-sched 0 thing with various
> > scx_* thingies on.
> 
> This is great. I was thinking that I needed to call pick_task() of other
> classes to detect the retry conditions but yeah enqueue() must be the
> triggering event and this is way neater. 

:-)

> Does this mean that balance() can be dropped from other classes too?

Possibly. But lets stick that on a todo list :-) I was also looking to
get rid of sched_class::pick_next_task() -- the only reason that
currently still exists is because of that fair cgroup nonsense.

I was looking at bringing back Rik's flatten series (easier now that the
cfs bandwidth throttle thing is fixed). That should get rid of that
hierarchical pick; and thus obviate that whole mess in
pick_next_task_fair().

> For the whole series:
> 
>  Acked-by: Tejun Heo <tj@...nel.org>
> 

Thanks!

Now, I do have a few questions from staring at this ext stuff for a
while:

 - So the 'RT' problem with balance_one() is due to:

    o rq-lock-break allowing higher prio task to come in
    o placing a task on the local dsq

   which results in that local-dsq getting 'starved'. The patches want
   to call switch_class(), which I suppose will work, this is something
   like:

   if (rq_modified_above(rq, &ext_sched_class)) {
      /*
       * We don't have a next task at this point, but can guess at its
       * class based on the highest set bit in the queue_mask.
       */
      scx_cpu_release(reason_from_mask(rq->queue_mask), NULL);
      return RETRY_TASK;
   }

   But I was thinking that we could also just stick that task back onto
   some global dsq, right? (presumably the one we just pulled it from is
   a good target). This would effectively 'undo' the balance_one().

 - finish_dispatch() -- one of the problems that I ran into with the
   shared rq lock implementation is that pick_task_scx() will in fact
   try and enqueue on a non-local dsq.

   The callchain is something like:

     pick_task_scx()
       bpf__sched_ext_ops_dispatch()
         scx_bpf_dsq_move_to_local()
	   flush_dispatch_buf()
	     dispatch_enqueue() // dsq->id != SCX_DSQ_LOCAL

   And this completely messes up the locking -- I'm not sure how to fix
   this yet. But adding this flush to do random other things to various
   code paths really complicates things. Per the function what we really
   want to do is move-to-local, but then we end up doing random other
   things instead :/

 - finish_dispatch() -- per the above problem I read this function and
   found that:

     "the BPF scheduler is allowed to dispatch tasks spuriously"

   and I had to go and buy a new WTF'o'meter again :/ Why would you
   allow such a thing? Detect the case because the BPF thing is
   untrusted and can do crazy things, sure. But then kill it dead; don't
   try and excuse badness.

 - scx_bpf_dsq_move_to_local() found per the above problem, but per its
   comment it is possible BPF calls this with its own locks held. This
   then results in:

   CPU1 

   try_to_wake_up()
     rq_lock();
     enqueue_task() := enqueue_task_scx()
       bpf__sched_ext_ops_something_or_other()
         your bpf area lock thing

   // rq->lock
   // bpf area lock

   CPU2
     bpf__sched_ext_whatever()
       bpf area lock
         scx_bpf_move_to_local()
	   rq_lock()

   // bpf area lock
   // rq->lock

  and we have a deadlock -- I thought BPF was supposed to be safe? And
  while the recent rqspinlock has a timeout, and there the bpf validator
  knows a spinlock exists and can prohibit kernel helper calls, this
  bpf area lock you have has no such things (afaict) and BPF can
  completely mess up the kernel. How is this okay?