linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aXp9xSpYhaysBLQ2@slm.duckdns.org>
Date: Wed, 28 Jan 2026 11:21:09 -1000
From: Tejun Heo <tj@...nel.org>
To: Andrea Righi <arighi@...dia.com>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
	Kuba Piecuch <jpiecuch@...gle.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	Emil Tsalapatis <emil@...alapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hello,

On Mon, Jan 26, 2026 at 09:41:49AM +0100, Andrea Righi wrote:
> @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
>  
>  	p->scx.ddsp_enq_flags |= enq_flags;
>  
> +	/*
> +	 * The task is about to be dispatched. If ops.enqueue() was called,
> +	 * notify the BPF scheduler by calling ops.dequeue().
> +	 *
> +	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
> +	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
> +	 * Mark that the dispatch dequeue has been called to distinguish
> +	 * from property change dequeues.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> +		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
> +	}

1. When to call ops.dequeue()?

I'm not sure whether deciding whether to call ops.dequeue() solely onwhether
ops.enqueue() was called. Direct dispatch has been expanded to include other
DSQs but was originally added as a way to shortcut the dispatch path and
"dispatch directly" for execution from ops.select_cpu/enqueue() paths. ie.
When a task is dispatched directly to a local DSQ, the BPF scheduler is done
with that task - the task is now in the same state with tasks that get
dispatched to a local DSQ from ops.dispatch().

ie. What effectively decides whether a task left the BPF scheduler is
whether the task reached a local DSQ or not, and direct dispatching into a
local DSQ shouldn't trigger ops.dequeue() - the task never really "queues"
on the BPF scheduler.

This creates another discrepancy - From ops.enqueue(), direct dispatching
into a non-local DSQ clearly makes the task enter the BPF scheduler and thus
its departure should trigger ops.dequeue(). What about a task which is
direct dispatched to a non-local DSQ from ops.select_cpu()? Superficially,
the right thing to do seems to skip ops.dequeue(). After all, the task has
never been ops.enqueue()'d. However, I think this is another case where
what's obvious doesn't agree with what's happening underneath.

ops.select_cpu() cannot actually queue anything. It's too early. Direct
dispatch from ops.select_cpu() is a shortcut to schedule direct dispatch
once the enqueue path is invoked so that the BPF scheudler can avoid
invocation of ops.enqueue() when the decision has already been made. While
this shortcut was added for convenience (so that e.g. the BPF scheduler
doesn't have to pass a note from ops.select_cpu() to ops.enqueue()), it has
real performance implications as it does save a roundtrip through
ops.enqueue() and we know that such overheads do matter for some use cases
(e.g. maximizing FPS on certain games).

So, while more subtle on the surface, I think the right thing to do is
basing the decision to call ops.dequeue() on the task's actual state -
ops.dequeue() should be called if the task is "on" the BPF scheduler - ie.
if the task ran ops.select_cpu/enqueue() paths and ended up in a non-local
DSQ or on the BPF side.

The subtlety would need clear documentation and we probably want to allow
ops.dequeue() to distinguish different cases. If you boil it down to the
actual task state, I don't think it's that subtle - if a task is in the
custody of the BPF scheduler, ops.dequeue() will be called. Otherwise, not.
Note that, this way, whether ops.dequeue() needs to be called agrees with
whether the task needs to be dispatched to run.

2. Why keep %SCX_TASK_OPS_ENQUEUED for %SCX_DEQ_SCHED_CHANGE?

Wouldn't that lead to calling ops.dequeue() more than once for the same
enqueue event? If the BPF scheduler is told that the task has left it
already, why does it matter whether the task gets dequeued for sched change
afterwards? e.g. from the BPF sched's POV, it shouldn't matter whether the
task is still on the local DSQ or already running, in which case the sched
class's dequeue() wouldn't be called in the first place, no?

Thanks.

-- 
tejun