linux-kernel - Re: [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYm00oHl3BRWNFFA@gpd4>
Date: Mon, 9 Feb 2026 11:20:02 +0100
From: Andrea Righi <arighi@...dia.com>
To: Emil Tsalapatis <emil@...alapatis.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Kuba Piecuch <jpiecuch@...gle.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] selftests/sched_ext: Add test to validate
 ops.dequeue() semantics

On Sun, Feb 08, 2026 at 09:08:38PM +0100, Andrea Righi wrote:
> On Sun, Feb 08, 2026 at 12:59:36PM -0500, Emil Tsalapatis wrote:
> > On Sun Feb 8, 2026 at 8:55 AM EST, Andrea Righi wrote:
> > > On Sun, Feb 08, 2026 at 11:26:13AM +0100, Andrea Righi wrote:
> > >> On Sun, Feb 08, 2026 at 10:02:41AM +0100, Andrea Righi wrote:
> > >> ...
> > >> > > >> >  - From ops.select_cpu():
> > >> > > >> >      - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass
> > >> > > >> >        the BPF scheduler entirely; they never enter BPF custody, so
> > >> > > >> >        ops.dequeue() is not called,
> > >> > > >> >      - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also
> > >> > > >> >        bypass the BPF scheduler, like the local DSQ; ops.dequeue() is
> > >> > > >> >        not called,
> > >> > > >> >      - scenario 2 (user DSQ): tasks enter BPF scheduler custody with full
> > >> > > >> >        enqueue/dequeue lifecycle tracking and state machine validation
> > >> > > >> >        (expects 1:1 enqueue/dequeue pairing).
> > >> > > >> 
> > >> > > >> Could you add a note here about why there's no equivalent to scenario 6?
> > >> > > >> The differentiating factor between that and scenario 2 (nonterminal queue) is 
> > >> > > >> that scx_dsq_insert_commit() is called regardless of whether the queue is terminal.
> > >> > > >> And this makes sense since for non-DSQ queues the BPF scheduler can do its
> > >> > > >> own tracking of enqueue/dequeue (plus it does not make too much sense to
> > >> > > >> do BPF-internal enqueueing in select_cpu).
> > >> > > >> 
> > >> > > >> What do you think? If the above makes sense, maybe we should spell it out 
> > >> > > >> in the documentation too. Maybe also add it makes no sense to enqueue
> > >> > > >> in an internal BPF structure from select_cpu - the task is not yet
> > >> > > >> enqueued, and would have to go through enqueue anyway.
> > >> > > >
> > >> > > > Oh, I just didn't think about it, we can definitely add to ops.select_cpu()
> > >> > > > a scenario equivalent to scenario 6 (push task to the BPF queue).
> > >> > > >
> > >> > > > From a practical standpoint the benefits are questionable, but in the scope
> > >> > > > of the kselftest I think it makes sense to better validate the entire state
> > >> > > > machine in all cases. I'll add this scenario as well.
> > >> > > >
> > >> > > 
> > >> > > That makes sense! Let's add it for completeness. Even if it doesn't make
> > >> > > sense right now that may change in the future. For example, if we end
> > >> > > up finding a good reason to add the task into an internal structure from
> > >> > > .select_cpu(), we may allow the task to be explicitly marked as being in
> > >> > > the BPF scheduler's custody from a kfunc. Right now we can't do that
> > >> > > from select_cpu() unless we direct dispatch IIUC.
> > >> > 
> > >> > Ok, I'll send a new patch later with the new scenario included. It should
> > >> > work already (if done properly in the test case), I think we don't need to
> > >> > change anything in the kernel.
> > >> 
> > >> Actually I take that back. The internal BPF queue from ops.select_cpu()
> > >> scenario is a bit tricky, because when we return from ops.select_cpu()
> > >> without p->scx.ddsp_dsq_id being set, we don't know if the scheduler added
> > >> the task to an internal BPF queue or simply did nothing.
> > >> 
> > >> We need to add some special logic here, preferably without introducing
> > >> overhead just to handle this particular (really uncommon) case. I'll take a
> > >> look.
> > >
> > > The more I think about this, the more it feels wrong to consider a task as
> > > being "in BPF scheduler custody" if it is stored in a BPF internal data
> > > structure from ops.select_cpu().
> > >
> > > At the point where ops.select_cpu() runs, the task has not yet entered the
> > > BPF scheduler's queues. While it is technically possible to stash the task
> > > in some BPF-managed structure from there, doing so should not imply full
> > > scheduler custody.
> > >
> > > In particular, we should not trigger ops.dequeue(), because the task has
> > > not reached the "enqueue" stage of its lifecycle. ops.select_cpu() is
> > > effectively a pre-enqueue hook, primarily intended as a fast path to bypass
> > > the scheduler altogether. As such, triggering ops.dequeue() in this case
> > > would not make sense IMHO.
> > >
> > > I think it would make more sense to document this behavior explicitly and
> > > leave the kselftest as is.
> > >
> > > Thoughts?
> > 
> > I am going back and forth on this but I think the problem is that the enqueue() 
> > and dequeue() BPF callbacks we have are not actually symmetrical? 
> > 
> > 1) ops.enqueue() is "sched-ext specific work for the scheduler core's enqueue
> > method". This is independent on whether the task ends up in BPF custody or not.
> > It could be in a terminal DSQ, a non-terminal DSQ, or a BPF data structure.
> > 
> > 2) ops.dequeue() is "remove task from BPF custody". E.g., it is used by the
> > BPF scheduler to signal whether it should keep a task within its
> > internal tracking structures.
> > 
> > So the edge case of ops.select_cpu() placing the task in BPF custody is
> > currently valid. The way I see it, we have two choices in terms of
> > semantics:
> > 
> > 1) ops.dequeue() must be the equivalent of ops.enqueue(). If the BPF
> > scheduler writer decides to place a task into BPF custody during the
> > ops.select_cpu() that's on them. ops.select_cpu() is supposed to be a
> > pure function providing a hint, anyway. Using it to place a task into
> > BPF is a bit of an abuse even if allowed.
> > 
> > 2) We interpret ops.dequeue() to mean "dequeue from the BPF scheduler".
> > In that case we allow the edge case and interpret ops.dequeue() as "the
> > function that must be called to clear the NEEDS_DEQ/IN_BPF flag", not as
> > the complement of ops.enqueue(). In most cases both will be true, and in
> > the cases where not then it's up to the scheduler writer to understand
> > the nuance.
> > 
> > I think while 2) is cleaner, it is more involved and honestly kinda
> > speculative. However, I think it's fair game since once we settle on
> > the semantics it will be more difficult to change them. Which one do you 
> > think makes more sense?
> 
> Yeah, I'm also going back and forth on this.
> 
> Honestly from a pure theoretical perspective, option (1) feels cleaner to
> me: when ops.select_cpu() runs, the task has not entered the BPF scheduler
> yet. If we trigger ops.dequeue() in this case, we end up with tasks that
> are "leaving" the scheduler without ever having entered it, which feels
> like a violation of the lifecycle model.
> 
> However, from a practical perspective, it's probably more convenient to
> trigger ops.dequeue() also for tasks that are stored in BPF data structures
> or user DSQs from ops.select_cpu() as well. If we don't allow that, we
> can't just silently ignore the behavior and it's also pretty hard to
> reliably detect and trigger an error for this kind of "abuse" at runtime.
> That means it could easily turn into a source of subtle bugs in the future,
> and I don't think documentation alone would be sufficient to prevent that
> (the "don't do that" rules are always fragile).
> 
> Therefore, at the moment I'm more inclined to go with option (2), as it
> provides better robustness and gives schedulers more flexibility.

I'm running into a number of headaches and corner cases if we go with
option (2)... One of them is the following.

Assume we push tasks into a BPF queue from ops.select_cpu() and pop them
from ops.dispatch(). The following scenario can happen:

  CPU0                                         CPU1
  ----                                         ----
  ops.select_cpu()
    bpf_map_push_elem(&queue, &pid, 0)
                                               ops.dispatch()
					         bpf_map_pop_elem(&queue, &pid)
						 scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | dst_cpu)
						   ==> ops.dequeue() is not triggered!
    p->scx.flags |= SCX_TASK_IN_BPF

To fix this, we would need to always set SCX_TASK_IN_BPF before calling
ops.select_cpu(), and then clear it again if the task is directly
dispatched to a terminal DSQ from ops.select_cpu().

However, doing so introduces further problems. In particular, we may end up
triggering spurious ops.dequeue() callbacks, which means we would then need
to distinguish whether a task entered BPF custody via ops.select_cpu() or
via ops.enqueue(), and handle the two cases differently. Which is also racy
and leads to additional locking and complexity.

At that point, it starts to feel like we're over-complicating the design to
support a scenario that is both uncommon and of questionable practical
value.

Given that, I'd suggest proceeding incrementally: for now, we go with
option (1), which looks doable without major changes and it probably fixes
the ops.dequeue() semantics for the majority of use cases (which is already
a significant improvement over the current state). Once that is in place,
we can revisit the "store tasks in internal BPF data structures from
ops.select_cpu()" scenario and see if it's worth supporting it in a cleaner
way. WDYT?

Thanks,
-Andrea