linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <DG1Y8XVKWV9O.24LWXW2G4RF63@google.com>
Date: Fri, 30 Jan 2026 13:14:23 +0000
From: Kuba Piecuch <jpiecuch@...gle.com>
To: Andrea Righi <arighi@...dia.com>, Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>, 
	Changwoo Min <changwoo@...lia.com>, Christian Loehle <christian.loehle@....com>, 
	Daniel Hodges <hodgesd@...a.com>, <sched-ext@...ts.linux.dev>, 
	<linux-kernel@...r.kernel.org>, Emil Tsalapatis <emil@...alapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hi Andrea,

On Fri Jan 30, 2026 at 7:34 AM UTC, Andrea Righi wrote:
...
> Good point, the confusion is on my side, the documentation overloads the
> term "enqueued" and doesn't clearly distinguish the different contexts.
>
> In that paragraph, "enqueued" refers to the ops lifecycle (i.e., a task for
> which ops.enqueue() has been called and whose scheduler-visible state is
> being tracked), not to the task being queued on a DSQ or having
> SCX_TASK_QUEUED set.
>
> The intent is to treat ops.enqueue() and ops.dequeue() as the boundaries of
> a scheduler-visible lifecycle, regardless of whether the task is eventually
> queued on a DSQ or dispatched directly.
>
> And as noted by Tejun in his last email, skipping ops.dequeue() for direct
> dispatches also makes sense, since in that case no new ops lifecycle is
> established (direct dispatch in ops.select_cpu() or ops.enqueue() can be
> seen as a shortcut to bypass the scheduler).

Right, skipping ops.dequeue() for direct dispatches makes sense, provided
the task is being dispatched to a local/global DSQ. Or at least that's my
takeaway after reading Tejun's email.

...
>> 
>> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
>> for a task at most once between it being dispatched and taken off the CPU,
>> even if its properties are changed multiple times while it's on CPU.
>> Is that intentional? I don't see it documented.
>> 
>> To illustrate, assume we have a task p that has been enqueued, dispatched, and
>> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
>> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
>> 
>> When a property of p is changed while it runs on the CPU,
>> the sequence of calls is:
>>   dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
>>   (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
>>   set_next_task_scx(p).
>> 
>> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
>> ops.dequeue(p, ... | SCHED_CHANGE) and clears
>> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
>> 
>> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
>> dequeue_task_scx().
>> 
>> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
>> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
>> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
>> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
>> 
>> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
>> this is not a core-sched pick, but it won't do much because the ops_state is
>> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
>> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
>> inserted into.
>> 
>> 
>> So, we end up in a state where any subsequent property change while the task is
>> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
>> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
>> unset in p->scx.flags.
>> 
>> I really hope I didn't mess anything up when tracing the code, but of course
>> I'm happy to be corrected.
>
> Correct. And the enqueue/dequeue balancing is preserved here. In the
> scenario you describe, subsequent property changes while the task remains
> running go through ENQUEUE_RESTORE, which intentionally skips
> ops.enqueue(). Since no new enqueue cycle is started, there is no
> corresponding ops.dequeue() to deliver either.
>
> In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
> scheduler state established by the last ops.enqueue(), not with every
> individual property change. Multiple property changes while the task stays
> on CPU are coalesced and the enqueue/dequeue pairing remains balanced.

Ok, I think I understand the logic behind this, here's how I understand it:

The BPF scheduler is naturally going to have some internal per-task state.
That state may be expensive to compute from scratch, so we don't want to
completely discard it when the BPF scheduler loses ownership of the task.

ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
"Hey, some scheduling properties of the task are about to change, so you
probably should invalidate whatever state you have for that task which depends
on these properties."

That way, the BPF scheduler will know to recompute the invalidated state on
the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
BPF scheduler knows that none of the task's fundamental scheduling properties
(priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
the state. Of course, the potential for savings depends on the particular
scheduler's policy.

This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
a task is running: for subsequent calls, the BPF scheduler had already been
notified to invalidate its state, so there's no use in notifying it again.

However, I feel like there's a hidden assumption here that the BPF scheduler
doesn't recompute its state for the task before the next ops.enqueue().
What if the scheduler wanted to immediately react to the priority of a task
being decreased by preempting it? You might say "hook into
ops.set_weight()", but then doesn't that obviate the need for
ops.dequeue(SCHED_CHANGE)?

I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property
changes that happen under ``scoped_guard (sched_change, ...)`` which don't have
a dedicated ops callback, but I wasn't able to find any such properties which
would be relevant to SCX.

Another thought on the design: currently, the exact meaning of
ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF
scheduler:

* When it's owned, it combines two notifications: BPF scheduler losing
  ownership AND that it should invalidate task state.
* When it's not owned, it only serves as an "invalidate" notification,
  the ownership status doesn't change.

Wouldn't it be more elegant to have another callback, say
ops.property_change(), which would only serve as the "invalidate" notification,
and leave ops.dequeue() only for tracking ownership?
That would mean calling ops.dequeue() followed by ops.property_change() when
changing properties of a task owned by the BPF scheduler, as opposed to a
single call to ops.dequeue(SCHED_CHANGE).

But honestly, when I put it like this, it gets harder to justify having this
callback over just using ops.set_weight() etc.

>
> I agree this distinction isn't obvious from the current documentation, I'll
> clarify that SCX_DEQ_SCHED_CHANGE is edge-triggered per enqueue/run cycle,
> not per property change.
>
> Do you see any practical use case where it'd be beneficial to tie
> individual ops.dequeue() calls to every property change, as opposed to the
> current coalesced behavior??

I don't know how practical it is, but in my comment above I mention a BPF
scheduler wanting to immediately preempt a running task on priority decrease,
but in that case we need to hook into ops.set_weight() anyway to find out
whether the priority was decreased.

Thanks,
Kuba