linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aX46tvnTjLZy0pCW@gpd4>
Date: Sat, 31 Jan 2026 18:24:06 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	Emil Tsalapatis <emil@...alapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hi Kuba,

On Sat, Jan 31, 2026 at 04:45:59PM +0000, Kuba Piecuch wrote:
...
> >> The BPF scheduler is naturally going to have some internal per-task state.
> >> That state may be expensive to compute from scratch, so we don't want to
> >> completely discard it when the BPF scheduler loses ownership of the task.
> >> 
> >> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
> >> "Hey, some scheduling properties of the task are about to change, so you
> >> probably should invalidate whatever state you have for that task which depends
> >> on these properties."
> >
> > Correct. And it's also a way to notify that the task has left the BPF
> > scheduler, so if the task is stored in any internal queue it can/should be
> > removed.
> 
> Right, unless the task has already been dispatched, in which case it's just
> an invalidation notification.

Right, but if the task has already been dispatched I don't think we should
trigger ops.dequeue(SCHED_CHANGE), because it's not anymore under the BPF
scheduler's custody (not the way it's implemented right now, I'm just
trying to define the proper semantics based on the latest disussions).

> >> That way, the BPF scheduler will know to recompute the invalidated state on
> >> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
> >> BPF scheduler knows that none of the task's fundamental scheduling properties
> >> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
> >> the state. Of course, the potential for savings depends on the particular
> >> scheduler's policy.
> >> 
> >> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
> >> a task is running: for subsequent calls, the BPF scheduler had already been
> >> notified to invalidate its state, so there's no use in notifying it again.
> >
> > Actually I think the proper behavior would be to trigger
> > ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF
> > scheduler. While running, tasks are outside the BPF scheduler ownership, so
> > ops.dequeue() shouldn't be triggered at all.
> >
> 
> I don't think this is what the current implementation does, right?

Right, sorry, I wasn't clear. I'm just trying to define the behavior that
makes more sense (see below).

> >> However, I feel like there's a hidden assumption here that the BPF scheduler
> >> doesn't recompute its state for the task before the next ops.enqueue().
> >
> > And that should be the proper behavior. BPF scheduler should recompute a
> > task state only when the task is re-enqueued after a property change.
> >
> 
> That would make sense if ops.enqueue() was called immediately after a property
> change when a task is running, but I believe that's currently not the case,
> see my attempt at tracing the enqueue-dequeue cycle on property change in my
> first reply.

Yeah, that's right.

I have a new patch set where I've implemented the following semantics (that
should match also Tejun's requirements).

With the new semantics:
 - for running tasks: property changes do NOT trigger ops.dequeue(SCHED_CHANGE)
 - once a task leaves BPF custody (dispatched to local DSQ), the BPF
   scheduler no longer manages it
 - property changes on running tasks don't affect the BPF scheduler

Key principle: ops.dequeue() is only called when a task leaves BPF
scheduler's custody. A running task has already left BPF custody, so
property changes don't trigger ops.dequeue().

Therefore, `ops.dequeue(SCHED_CHANGE)` gets called only when:
 - task is in BPF data structures (QUEUED state), or
 - task is on a non-local DSQ (still in BPF custody)

In this case (BPF scheduler custody), if a property change happens,
ops.dequeue(SCHED_CHANGE) is called to notify the BPF scheduler.

Then if you want to react immediately on priority changes for running tasks
we have:
 - ops.set_cpumask(): CPU affinity changes
 - ops.set_weight(): priority/nice changes
 - ops.cgroup_*(): cgroup changes

In conclusion, we don't need ops.dequeue(SCHED_CHANGE) for running tasks,
the dedicated callbacks (ops.set_cpumask(), ops.set_weight(), ...) already
provide comprehensive coverage for property changes on all tasks,
regardless of whether they're running or in BPF custody. And the new
ops.dequeue(SCHED_CHANGE) semantics only notifies for property changes when
tasks are actively managed by the BPF scheduler (in QUEUED state or on
non-local DSQs).

Do you think it's reasonable enough / do you see any flaws?

Thanks,
-Andrea