linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DG2XDI382JPD.6GDH6BO96EXY@google.com>
Date: Sat, 31 Jan 2026 16:45:59 +0000
From: Kuba Piecuch <jpiecuch@...gle.com>
To: Andrea Righi <arighi@...dia.com>, Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>, 
	Changwoo Min <changwoo@...lia.com>, Christian Loehle <christian.loehle@....com>, 
	Daniel Hodges <hodgesd@...a.com>, <sched-ext@...ts.linux.dev>, 
	<linux-kernel@...r.kernel.org>, Emil Tsalapatis <emil@...alapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hi Andrea,

On Sat Jan 31, 2026 at 6:54 AM UTC, Andrea Righi wrote:
>> >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
>> >> for a task at most once between it being dispatched and taken off the CPU,
>> >> even if its properties are changed multiple times while it's on CPU.
>> >> Is that intentional? I don't see it documented.
>> >> 
>> >> To illustrate, assume we have a task p that has been enqueued, dispatched, and
>> >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
>> >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
>> >> 
>> >> When a property of p is changed while it runs on the CPU,
>> >> the sequence of calls is:
>> >>   dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
>> >>   (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
>> >>   set_next_task_scx(p).
>> >> 
>> >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
>> >> ops.dequeue(p, ... | SCHED_CHANGE) and clears
>> >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
>> >> 
>> >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
>> >> dequeue_task_scx().
>> >> 
>> >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
>> >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
>> >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
>> >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
>> >> 
>> >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
>> >> this is not a core-sched pick, but it won't do much because the ops_state is
>> >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
>> >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
>> >> inserted into.
>> >> 
>> >> 
>> >> So, we end up in a state where any subsequent property change while the task is
>> >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
>> >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
>> >> unset in p->scx.flags.
>> >> 
>> >> I really hope I didn't mess anything up when tracing the code, but of course
>> >> I'm happy to be corrected.
>> >
>> > Correct. And the enqueue/dequeue balancing is preserved here. In the
>> > scenario you describe, subsequent property changes while the task remains
>> > running go through ENQUEUE_RESTORE, which intentionally skips
>> > ops.enqueue(). Since no new enqueue cycle is started, there is no
>> > corresponding ops.dequeue() to deliver either.
>> >
>> > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
>> > scheduler state established by the last ops.enqueue(), not with every
>> > individual property change. Multiple property changes while the task stays
>> > on CPU are coalesced and the enqueue/dequeue pairing remains balanced.
>> 
>> Ok, I think I understand the logic behind this, here's how I understand it:
>> 
>> The BPF scheduler is naturally going to have some internal per-task state.
>> That state may be expensive to compute from scratch, so we don't want to
>> completely discard it when the BPF scheduler loses ownership of the task.
>> 
>> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
>> "Hey, some scheduling properties of the task are about to change, so you
>> probably should invalidate whatever state you have for that task which depends
>> on these properties."
>
> Correct. And it's also a way to notify that the task has left the BPF
> scheduler, so if the task is stored in any internal queue it can/should be
> removed.

Right, unless the task has already been dispatched, in which case it's just
an invalidation notification.

>
>> 
>> That way, the BPF scheduler will know to recompute the invalidated state on
>> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
>> BPF scheduler knows that none of the task's fundamental scheduling properties
>> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
>> the state. Of course, the potential for savings depends on the particular
>> scheduler's policy.
>> 
>> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
>> a task is running: for subsequent calls, the BPF scheduler had already been
>> notified to invalidate its state, so there's no use in notifying it again.
>
> Actually I think the proper behavior would be to trigger
> ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF
> scheduler. While running, tasks are outside the BPF scheduler ownership, so
> ops.dequeue() shouldn't be triggered at all.
>

I don't think this is what the current implementation does, right?

>> 
>> However, I feel like there's a hidden assumption here that the BPF scheduler
>> doesn't recompute its state for the task before the next ops.enqueue().
>
> And that should be the proper behavior. BPF scheduler should recompute a
> task state only when the task is re-enqueued after a property change.
>

That would make sense if ops.enqueue() was called immediately after a property
change when a task is running, but I believe that's currently not the case,
see my attempt at tracing the enqueue-dequeue cycle on property change in my
first reply.

>> What if the scheduler wanted to immediately react to the priority of a task
>> being decreased by preempting it? You might say "hook into
>> ops.set_weight()", but then doesn't that obviate the need for
>> ops.dequeue(SCHED_CHANGE)?
>
> If a scheduler wants to implement preemption on property change, it can do
> so in ops.enqueue(): after a property change, the task is re-enqueued,
> triggering ops.enqueue(), at which point the BPF scheduler can decide
> whether and how to preempt currently running tasks.
>
> If a property change does not result in an ops.enqueue() call, it means the
> task is not runnable yet (or does not intend to run), so attempting to
> trigger a preemption at that point would be pointless.
>

IIUC a dequeue-enqueue cycle on a running task during property change doesn't
result in a call to ops.enqueue(), so if the BPF scheduler recomputed its state
only in ops.enqueue(), then it wouldn't be able to react immediately.

>> 
>> I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property
>> changes that happen under ``scoped_guard (sched_change, ...)`` which don't have
>> a dedicated ops callback, but I wasn't able to find any such properties which
>> would be relevant to SCX.
>> 
>> Another thought on the design: currently, the exact meaning of
>> ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF
>> scheduler:
>> 
>> * When it's owned, it combines two notifications: BPF scheduler losing
>>   ownership AND that it should invalidate task state.
>> * When it's not owned, it only serves as an "invalidate" notification,
>>   the ownership status doesn't change.
>
> When it's not owned I think ops.dequeue() shouldn't be triggered at all.
>
>> 
>> Wouldn't it be more elegant to have another callback, say
>> ops.property_change(), which would only serve as the "invalidate" notification,
>> and leave ops.dequeue() only for tracking ownership?
>> That would mean calling ops.dequeue() followed by ops.property_change() when
>> changing properties of a task owned by the BPF scheduler, as opposed to a
>> single call to ops.dequeue(SCHED_CHANGE).
>
> We could provide an ops.property_change(), but honestly I don't see any
> practical usage of this callback.
>

Neither do I, I just made it up for the sake of argument :-)

Thanks,
Kuba