linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aXN4binznj01rNWO@gpd4>
Date: Fri, 23 Jan 2026 14:32:30 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

On Thu, Jan 22, 2026 at 09:28:39AM +0000, Kuba Piecuch wrote:
> [Resending with reply-all, messed up the first time, apologies.]

Re-sendind my reply as well, just for the records. :)

> 
> Hi Andrea,
> 
> On Wed Jan 21, 2026 at 12:25 PM UTC, Andrea Righi wrote:
> > Currently, ops.dequeue() is only invoked when the sched_ext core knows
> > that a task resides in BPF-managed data structures, which causes it to
> > miss scheduling property change scenarios. As a result, BPF schedulers
> > cannot reliably track task state.
> >
> > In addition, some ops.dequeue() callbacks can be skipped (e.g., during
> > direct dispatch), so ops.enqueue() calls are not always paired with a
> > corresponding ops.dequeue(), potentially breaking accounting logic.
> >
> > Fix this by guaranteeing that every ops.enqueue() is matched with a
> > corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to
> > distinguish dequeues triggered by scheduling property changes from those
> > occurring in the normal dispatch workflow.
> >
> > New semantics:
> > 1. ops.enqueue() is called when a task enters the BPF scheduler
> > 2. ops.dequeue() is called when the task leaves the BPF scheduler,
> >    because it is dispatched to a DSQ (regular workflow)
> > 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF
> >    scheduler, because a task property is changed (sched_change)
> 
> What about the case where ops.dequeue() is called due to core-sched picking the
> task through sched_core_find()? If I understand core-sched correctly, it can
> happen without prior dispatch, so it doesn't fit case 2, and we're not changing
> task properties, so it doesn't fit case 3 either.

You're absolutely right, core-sched picks are inconsistently handled.
They're treated as property change dequeues in the SCX_OPSS_QUEUED case and
as dispatch dequeues in SCX_OPSS_NONE.

Core-sched picks should be treated consistently as regular dequeues since
they're not property changes. I'll fix this in the next version (adding
SCX_DEQ_CORE_SCHED_EXEC check in the SCX_OPSS_QUEUED should make the
core-sched case consistent).

> 
> > +     /*
> > +      * Set when ops.dequeue() is called after successful dispatch; used to
> > +      * distinguish dispatch dequeues from async dequeues (property changes)
> > +      * and to prevent duplicate dequeue calls.
> > +      */
> > +     SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
> 
> I see this flag being set and cleared in several places, but I don't see it
> actually being read, is that intentional?

And you're right here as well. At some point this was used to distinguish
dispatch dequeues vs async dequeues, but isn't actually used anymore. I'll
clean this up in the next version.

Thanks,
-Andrea