linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aX2nFSL5QilKsGsm@gpd4>
Date: Sat, 31 Jan 2026 07:54:13 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	Emil Tsalapatis <emil@...alapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hi Kuba,

On Fri, Jan 30, 2026 at 01:14:23PM +0000, Kuba Piecuch wrote:
...
> >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
> >> for a task at most once between it being dispatched and taken off the CPU,
> >> even if its properties are changed multiple times while it's on CPU.
> >> Is that intentional? I don't see it documented.
> >> 
> >> To illustrate, assume we have a task p that has been enqueued, dispatched, and
> >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
> >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
> >> 
> >> When a property of p is changed while it runs on the CPU,
> >> the sequence of calls is:
> >>   dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
> >>   (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
> >>   set_next_task_scx(p).
> >> 
> >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
> >> ops.dequeue(p, ... | SCHED_CHANGE) and clears
> >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
> >> 
> >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
> >> dequeue_task_scx().
> >> 
> >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
> >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
> >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
> >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
> >> 
> >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
> >> this is not a core-sched pick, but it won't do much because the ops_state is
> >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
> >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
> >> inserted into.
> >> 
> >> 
> >> So, we end up in a state where any subsequent property change while the task is
> >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
> >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
> >> unset in p->scx.flags.
> >> 
> >> I really hope I didn't mess anything up when tracing the code, but of course
> >> I'm happy to be corrected.
> >
> > Correct. And the enqueue/dequeue balancing is preserved here. In the
> > scenario you describe, subsequent property changes while the task remains
> > running go through ENQUEUE_RESTORE, which intentionally skips
> > ops.enqueue(). Since no new enqueue cycle is started, there is no
> > corresponding ops.dequeue() to deliver either.
> >
> > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
> > scheduler state established by the last ops.enqueue(), not with every
> > individual property change. Multiple property changes while the task stays
> > on CPU are coalesced and the enqueue/dequeue pairing remains balanced.
> 
> Ok, I think I understand the logic behind this, here's how I understand it:
> 
> The BPF scheduler is naturally going to have some internal per-task state.
> That state may be expensive to compute from scratch, so we don't want to
> completely discard it when the BPF scheduler loses ownership of the task.
> 
> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
> "Hey, some scheduling properties of the task are about to change, so you
> probably should invalidate whatever state you have for that task which depends
> on these properties."

Correct. And it's also a way to notify that the task has left the BPF
scheduler, so if the task is stored in any internal queue it can/should be
removed.

> 
> That way, the BPF scheduler will know to recompute the invalidated state on
> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
> BPF scheduler knows that none of the task's fundamental scheduling properties
> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
> the state. Of course, the potential for savings depends on the particular
> scheduler's policy.
> 
> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
> a task is running: for subsequent calls, the BPF scheduler had already been
> notified to invalidate its state, so there's no use in notifying it again.

Actually I think the proper behavior would be to trigger
ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF
scheduler. While running, tasks are outside the BPF scheduler ownership, so
ops.dequeue() shouldn't be triggered at all.

> 
> However, I feel like there's a hidden assumption here that the BPF scheduler
> doesn't recompute its state for the task before the next ops.enqueue().

And that should be the proper behavior. BPF scheduler should recompute a
task state only when the task is re-enqueued after a property change.

> What if the scheduler wanted to immediately react to the priority of a task
> being decreased by preempting it? You might say "hook into
> ops.set_weight()", but then doesn't that obviate the need for
> ops.dequeue(SCHED_CHANGE)?

If a scheduler wants to implement preemption on property change, it can do
so in ops.enqueue(): after a property change, the task is re-enqueued,
triggering ops.enqueue(), at which point the BPF scheduler can decide
whether and how to preempt currently running tasks.

If a property change does not result in an ops.enqueue() call, it means the
task is not runnable yet (or does not intend to run), so attempting to
trigger a preemption at that point would be pointless.

> 
> I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property
> changes that happen under ``scoped_guard (sched_change, ...)`` which don't have
> a dedicated ops callback, but I wasn't able to find any such properties which
> would be relevant to SCX.
> 
> Another thought on the design: currently, the exact meaning of
> ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF
> scheduler:
> 
> * When it's owned, it combines two notifications: BPF scheduler losing
>   ownership AND that it should invalidate task state.
> * When it's not owned, it only serves as an "invalidate" notification,
>   the ownership status doesn't change.

When it's not owned I think ops.dequeue() shouldn't be triggered at all.

> 
> Wouldn't it be more elegant to have another callback, say
> ops.property_change(), which would only serve as the "invalidate" notification,
> and leave ops.dequeue() only for tracking ownership?
> That would mean calling ops.dequeue() followed by ops.property_change() when
> changing properties of a task owned by the BPF scheduler, as opposed to a
> single call to ops.dequeue(SCHED_CHANGE).

We could provide an ops.property_change(), but honestly I don't see any
practical usage of this callback.

Thanks,
-Andrea