linux-kernel - Re: [RFC][PATCH 0/3] sched/ext: Cleanup pick_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aObH3RNKbRX1FXNn@slm.duckdns.org>
Date: Wed, 8 Oct 2025 10:21:49 -1000
From: Tejun Heo <tj@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...nel.org, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de,
	alexei.starovoitov@...il.com
Subject: Re: [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx()

Hello,

On Wed, Oct 08, 2025 at 11:11:51AM +0200, Peter Zijlstra wrote:
...
> Possibly. But lets stick that on a todo list :-) I was also looking to
> get rid of sched_class::pick_next_task() -- the only reason that
> currently still exists is because of that fair cgroup nonsense.
> 
> I was looking at bringing back Rik's flatten series (easier now that the
> cfs bandwidth throttle thing is fixed). That should get rid of that
> hierarchical pick; and thus obviate that whole mess in
> pick_next_task_fair().

Yeah, that'd be great. IIRC, the blocking issue was dealing with thundering
herds. ISTR Rik discussing a way to resolve that.

> Now, I do have a few questions from staring at this ext stuff for a
> while:
> 
>  - So the 'RT' problem with balance_one() is due to:
> 
>     o rq-lock-break allowing higher prio task to come in
>     o placing a task on the local dsq
> 
>    which results in that local-dsq getting 'starved'. The patches want
>    to call switch_class(), which I suppose will work, this is something
>    like:
> 
>    if (rq_modified_above(rq, &ext_sched_class)) {
>       /*
>        * We don't have a next task at this point, but can guess at its
>        * class based on the highest set bit in the queue_mask.
>        */
>       scx_cpu_release(reason_from_mask(rq->queue_mask), NULL);
>       return RETRY_TASK;
>    }
> 
>    But I was thinking that we could also just stick that task back onto
>    some global dsq, right? (presumably the one we just pulled it from is
>    a good target). This would effectively 'undo' the balance_one().

>From ops.cpu_release(), the BPF scheduler can call scx_bpf_reenqueue_local()
which runs the enqueue path again for the queued tasks so that the scheduler
can decide on what to do with these tasks including sticking it into a
global DSQ or putting it back to the source DSQ. Yeah, it may make sense to
make some bulit-in actions available through ops flags.

>  - finish_dispatch() -- one of the problems that I ran into with the
>    shared rq lock implementation is that pick_task_scx() will in fact
>    try and enqueue on a non-local dsq.
> 
>    The callchain is something like:
> 
>      pick_task_scx()
>        bpf__sched_ext_ops_dispatch()
>          scx_bpf_dsq_move_to_local()
> 	   flush_dispatch_buf()
> 	     dispatch_enqueue() // dsq->id != SCX_DSQ_LOCAL
>  
>    And this completely messes up the locking -- I'm not sure how to fix
>    this yet. But adding this flush to do random other things to various
>    code paths really complicates things. Per the function what we really
>    want to do is move-to-local, but then we end up doing random other
>    things instead :/

Hmm... this is deferred execution of tasks being dispatched to non-local
DSQs. The reason why they're deferred is to avoid entanglement with BPF side
synchronization, although BPF side locking developed in a different way than
we were thinking back then, so we may be able to remove the deferred
operations. Need to think more about that.

However, I'm not sure that would change much w.r.t. how it interacts with
core locking. While deferred, it's deferred within ops.dispatch(). ie.
ops.dispatch() is allowed to move any task to any DSQ. Whether that's
deferred or not within ops.dispatch() shouldn't make much difference.

>  - finish_dispatch() -- per the above problem I read this function and
>    found that:
> 
>      "the BPF scheduler is allowed to dispatch tasks spuriously"
> 
>    and I had to go and buy a new WTF'o'meter again :/ Why would you
>    allow such a thing? Detect the case because the BPF thing is
>    untrusted and can do crazy things, sure. But then kill it dead; don't
>    try and excuse badness.

This came to be because the BPF data structures that we were playing with
earlier didn't easily support arbitrary element removals. The kernel side
needs to be able to reliably detect whether the dispatch attempt is allowed
or not anyway, so, as a workaround, instead of aborting the scheduler, I
just made it ignore spurious attempts. With new arena based data structures,
this shouldn't be a problem anymore, and it'd make sense to make this
stricter.

>  - scx_bpf_dsq_move_to_local() found per the above problem, but per its
>    comment it is possible BPF calls this with its own locks held. This
>    then results in:
> 
>    CPU1 
> 
>    try_to_wake_up()
>      rq_lock();
>      enqueue_task() := enqueue_task_scx()
>        bpf__sched_ext_ops_something_or_other()
>          your bpf area lock thing
> 
>    // rq->lock
>    // bpf area lock
> 
>    CPU2
>      bpf__sched_ext_whatever()
>        bpf area lock
>          scx_bpf_move_to_local()
> 	   rq_lock()
> 
>    // bpf area lock
>    // rq->lock
> 
>   and we have a deadlock -- I thought BPF was supposed to be safe? And
>   while the recent rqspinlock has a timeout, and there the bpf validator
>   knows a spinlock exists and can prohibit kernel helper calls, this
>   bpf area lock you have has no such things (afaict) and BPF can
>   completely mess up the kernel. How is this okay?

BPF arena locks are not allowed to spin indefinitely. It will break out of
spinning and fail the locking attempt. Right now, such failures need to be
handled manually, but, once the BPF program abort support is merged, we
should be able to auto-abort schedulers.

Thanks.

-- 
tejun