linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aX3FG00RNMv8VnQQ@gpd4>
Date: Sat, 31 Jan 2026 10:02:19 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	Emil Tsalapatis <emil@...alapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote:
> Hi Tejun,
> 
> On Wed Jan 28, 2026 at 9:21 PM UTC, Tejun Heo wrote:
> ...
> > 1. When to call ops.dequeue()?
> >
> > I'm not sure whether deciding whether to call ops.dequeue() solely onwhether
> > ops.enqueue() was called. Direct dispatch has been expanded to include other
> > DSQs but was originally added as a way to shortcut the dispatch path and
> > "dispatch directly" for execution from ops.select_cpu/enqueue() paths. ie.
> > When a task is dispatched directly to a local DSQ, the BPF scheduler is done
> > with that task - the task is now in the same state with tasks that get
> > dispatched to a local DSQ from ops.dispatch().
> >
> > ie. What effectively decides whether a task left the BPF scheduler is
> > whether the task reached a local DSQ or not, and direct dispatching into a
> > local DSQ shouldn't trigger ops.dequeue() - the task never really "queues"
> > on the BPF scheduler.
> 
> Is "local" short for "local or global", i.e. not user-created?
> Direct dispatching into the global DSQ also shouldn't trigger ops.dequeue(),
> since dispatch isn't necessary for the task to run. This follows from the last
> paragraph:
> 
>   Note that, this way, whether ops.dequeue() needs to be called agrees with
>   whether the task needs to be dispatched to run.
> 
> I agree with your points, just wanted to clarify this one thing.

I think this should be interpreted as local DSQs only
(SCX_DSQ_LOCAL / SCX_DSQ_LOCAL_ON), not any built-in DSQ. SCX_DSQ_GLOBAL is
essentially a built-in user DSQ, provided for convenience, it's not really
a "direct dispatch" DSQ.

> 
> >
> > This creates another discrepancy - From ops.enqueue(), direct dispatching
> > into a non-local DSQ clearly makes the task enter the BPF scheduler and thus
> > its departure should trigger ops.dequeue(). What about a task which is
> > direct dispatched to a non-local DSQ from ops.select_cpu()? Superficially,
> > the right thing to do seems to skip ops.dequeue(). After all, the task has
> > never been ops.enqueue()'d. However, I think this is another case where
> > what's obvious doesn't agree with what's happening underneath.
> >
> > ops.select_cpu() cannot actually queue anything. It's too early. Direct
> > dispatch from ops.select_cpu() is a shortcut to schedule direct dispatch
> > once the enqueue path is invoked so that the BPF scheudler can avoid
> > invocation of ops.enqueue() when the decision has already been made. While
> > this shortcut was added for convenience (so that e.g. the BPF scheduler
> > doesn't have to pass a note from ops.select_cpu() to ops.enqueue()), it has
> > real performance implications as it does save a roundtrip through
> > ops.enqueue() and we know that such overheads do matter for some use cases
> > (e.g. maximizing FPS on certain games).
> >
> > So, while more subtle on the surface, I think the right thing to do is
> > basing the decision to call ops.dequeue() on the task's actual state -
> > ops.dequeue() should be called if the task is "on" the BPF scheduler - ie.
> > if the task ran ops.select_cpu/enqueue() paths and ended up in a non-local
> > DSQ or on the BPF side.
> >
> > The subtlety would need clear documentation and we probably want to allow
> > ops.dequeue() to distinguish different cases. If you boil it down to the
> > actual task state, I don't think it's that subtle - if a task is in the
> > custody of the BPF scheduler, ops.dequeue() will be called. Otherwise, not.
> > Note that, this way, whether ops.dequeue() needs to be called agrees with
> > whether the task needs to be dispatched to run.
> 
> Here's my attempt at documenting this behavior:
> 
> After ops.enqueue() is called on a task, the task is owned by the BPF
> scheduler, provided the task wasn't direct-dispatched to a local/global DSQ.
> When a task is owned by the BPF scheduler, the scheduler needs to dispatch the
> task to a local/global DSQ in order for it to run.
> When the BPF scheduler loses ownership of the task, either due to dispatching it
> to a local/global DSQ or due to external events (core-sched pick, CPU
> migration, scheduling property changes), the BPF scheduler is notified through
> ops.dequeue() with appropriate flags (TBD).

This looks good overall, except for the global DSQ part. Also, it might be
better to avoid the term “owned”, internally the kernel already uses the
concept of "task ownership" with a different meaning (see
https://lore.kernel.org/all/aVHAZNbIJLLBHEXY@slm.duckdns.org), and reusing
it here could be misleading.

With that in mind, I'd probably rephrase your documentation along these
lines:

After ops.enqueue() is called, the task is considered *enqueued* by the BPF
scheduler, unless it is directly dispatched to a local DSQ (via
SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON).

While a task is enqueued, the BPF scheduler must explicitly dispatch it to
a DSQ in order for it to run.

When a task leaves the enqueued state (either because it is dispatched to a
non-local DSQ, or due to external events such as a core-sched pick, CPU
migration, or scheduling property changes), ops.dequeue() is invoked to
notify the BPF scheduler, with flags indicating the reason for the dequeue:
regular dispatch dequeues have no flags set, whereas dequeues triggered by
scheduling property changes are reported with SCX_DEQ_SCHED_CHANGE.

What do you think?

Thanks,
-Andrea