[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYoAmES8tKNZJkdx@gpd4>
Date: Mon, 9 Feb 2026 16:43:20 +0100
From: Andrea Righi <arighi@...dia.com>
To: Emil Tsalapatis <emil@...alapatis.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>,
Kuba Piecuch <jpiecuch@...gle.com>,
Christian Loehle <christian.loehle@....com>,
Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] selftests/sched_ext: Add test to validate
ops.dequeue() semantics
On Mon, Feb 09, 2026 at 10:00:40AM -0500, Emil Tsalapatis wrote:
> On Mon Feb 9, 2026 at 5:20 AM EST, Andrea Righi wrote:
> > On Sun, Feb 08, 2026 at 09:08:38PM +0100, Andrea Righi wrote:
> >> On Sun, Feb 08, 2026 at 12:59:36PM -0500, Emil Tsalapatis wrote:
> >> > On Sun Feb 8, 2026 at 8:55 AM EST, Andrea Righi wrote:
> >> > > On Sun, Feb 08, 2026 at 11:26:13AM +0100, Andrea Righi wrote:
> >> > >> On Sun, Feb 08, 2026 at 10:02:41AM +0100, Andrea Righi wrote:
> >> > >> ...
> >> > >> > > >> > - From ops.select_cpu():
> >> > >> > > >> > - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass
> >> > >> > > >> > the BPF scheduler entirely; they never enter BPF custody, so
> >> > >> > > >> > ops.dequeue() is not called,
> >> > >> > > >> > - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also
> >> > >> > > >> > bypass the BPF scheduler, like the local DSQ; ops.dequeue() is
> >> > >> > > >> > not called,
> >> > >> > > >> > - scenario 2 (user DSQ): tasks enter BPF scheduler custody with full
> >> > >> > > >> > enqueue/dequeue lifecycle tracking and state machine validation
> >> > >> > > >> > (expects 1:1 enqueue/dequeue pairing).
> >> > >> > > >>
> >> > >> > > >> Could you add a note here about why there's no equivalent to scenario 6?
> >> > >> > > >> The differentiating factor between that and scenario 2 (nonterminal queue) is
> >> > >> > > >> that scx_dsq_insert_commit() is called regardless of whether the queue is terminal.
> >> > >> > > >> And this makes sense since for non-DSQ queues the BPF scheduler can do its
> >> > >> > > >> own tracking of enqueue/dequeue (plus it does not make too much sense to
> >> > >> > > >> do BPF-internal enqueueing in select_cpu).
> >> > >> > > >>
> >> > >> > > >> What do you think? If the above makes sense, maybe we should spell it out
> >> > >> > > >> in the documentation too. Maybe also add it makes no sense to enqueue
> >> > >> > > >> in an internal BPF structure from select_cpu - the task is not yet
> >> > >> > > >> enqueued, and would have to go through enqueue anyway.
> >> > >> > > >
> >> > >> > > > Oh, I just didn't think about it, we can definitely add to ops.select_cpu()
> >> > >> > > > a scenario equivalent to scenario 6 (push task to the BPF queue).
> >> > >> > > >
> >> > >> > > > From a practical standpoint the benefits are questionable, but in the scope
> >> > >> > > > of the kselftest I think it makes sense to better validate the entire state
> >> > >> > > > machine in all cases. I'll add this scenario as well.
> >> > >> > > >
> >> > >> > >
> >> > >> > > That makes sense! Let's add it for completeness. Even if it doesn't make
> >> > >> > > sense right now that may change in the future. For example, if we end
> >> > >> > > up finding a good reason to add the task into an internal structure from
> >> > >> > > .select_cpu(), we may allow the task to be explicitly marked as being in
> >> > >> > > the BPF scheduler's custody from a kfunc. Right now we can't do that
> >> > >> > > from select_cpu() unless we direct dispatch IIUC.
> >> > >> >
> >> > >> > Ok, I'll send a new patch later with the new scenario included. It should
> >> > >> > work already (if done properly in the test case), I think we don't need to
> >> > >> > change anything in the kernel.
> >> > >>
> >> > >> Actually I take that back. The internal BPF queue from ops.select_cpu()
> >> > >> scenario is a bit tricky, because when we return from ops.select_cpu()
> >> > >> without p->scx.ddsp_dsq_id being set, we don't know if the scheduler added
> >> > >> the task to an internal BPF queue or simply did nothing.
> >> > >>
> >> > >> We need to add some special logic here, preferably without introducing
> >> > >> overhead just to handle this particular (really uncommon) case. I'll take a
> >> > >> look.
> >> > >
> >> > > The more I think about this, the more it feels wrong to consider a task as
> >> > > being "in BPF scheduler custody" if it is stored in a BPF internal data
> >> > > structure from ops.select_cpu().
> >> > >
> >> > > At the point where ops.select_cpu() runs, the task has not yet entered the
> >> > > BPF scheduler's queues. While it is technically possible to stash the task
> >> > > in some BPF-managed structure from there, doing so should not imply full
> >> > > scheduler custody.
> >> > >
> >> > > In particular, we should not trigger ops.dequeue(), because the task has
> >> > > not reached the "enqueue" stage of its lifecycle. ops.select_cpu() is
> >> > > effectively a pre-enqueue hook, primarily intended as a fast path to bypass
> >> > > the scheduler altogether. As such, triggering ops.dequeue() in this case
> >> > > would not make sense IMHO.
> >> > >
> >> > > I think it would make more sense to document this behavior explicitly and
> >> > > leave the kselftest as is.
> >> > >
> >> > > Thoughts?
> >> >
> >> > I am going back and forth on this but I think the problem is that the enqueue()
> >> > and dequeue() BPF callbacks we have are not actually symmetrical?
> >> >
> >> > 1) ops.enqueue() is "sched-ext specific work for the scheduler core's enqueue
> >> > method". This is independent on whether the task ends up in BPF custody or not.
> >> > It could be in a terminal DSQ, a non-terminal DSQ, or a BPF data structure.
> >> >
> >> > 2) ops.dequeue() is "remove task from BPF custody". E.g., it is used by the
> >> > BPF scheduler to signal whether it should keep a task within its
> >> > internal tracking structures.
> >> >
> >> > So the edge case of ops.select_cpu() placing the task in BPF custody is
> >> > currently valid. The way I see it, we have two choices in terms of
> >> > semantics:
> >> >
> >> > 1) ops.dequeue() must be the equivalent of ops.enqueue(). If the BPF
> >> > scheduler writer decides to place a task into BPF custody during the
> >> > ops.select_cpu() that's on them. ops.select_cpu() is supposed to be a
> >> > pure function providing a hint, anyway. Using it to place a task into
> >> > BPF is a bit of an abuse even if allowed.
> >> >
> >> > 2) We interpret ops.dequeue() to mean "dequeue from the BPF scheduler".
> >> > In that case we allow the edge case and interpret ops.dequeue() as "the
> >> > function that must be called to clear the NEEDS_DEQ/IN_BPF flag", not as
> >> > the complement of ops.enqueue(). In most cases both will be true, and in
> >> > the cases where not then it's up to the scheduler writer to understand
> >> > the nuance.
> >> >
> >> > I think while 2) is cleaner, it is more involved and honestly kinda
> >> > speculative. However, I think it's fair game since once we settle on
> >> > the semantics it will be more difficult to change them. Which one do you
> >> > think makes more sense?
> >>
> >> Yeah, I'm also going back and forth on this.
> >>
> >> Honestly from a pure theoretical perspective, option (1) feels cleaner to
> >> me: when ops.select_cpu() runs, the task has not entered the BPF scheduler
> >> yet. If we trigger ops.dequeue() in this case, we end up with tasks that
> >> are "leaving" the scheduler without ever having entered it, which feels
> >> like a violation of the lifecycle model.
> >>
> >> However, from a practical perspective, it's probably more convenient to
> >> trigger ops.dequeue() also for tasks that are stored in BPF data structures
> >> or user DSQs from ops.select_cpu() as well. If we don't allow that, we
> >> can't just silently ignore the behavior and it's also pretty hard to
> >> reliably detect and trigger an error for this kind of "abuse" at runtime.
> >> That means it could easily turn into a source of subtle bugs in the future,
> >> and I don't think documentation alone would be sufficient to prevent that
> >> (the "don't do that" rules are always fragile).
> >>
> >> Therefore, at the moment I'm more inclined to go with option (2), as it
> >> provides better robustness and gives schedulers more flexibility.
> >
> > I'm running into a number of headaches and corner cases if we go with
> > option (2)... One of them is the following.
> >
> > Assume we push tasks into a BPF queue from ops.select_cpu() and pop them
> > from ops.dispatch(). The following scenario can happen:
> >
> > CPU0 CPU1
> > ---- ----
> > ops.select_cpu()
> > bpf_map_push_elem(&queue, &pid, 0)
> > ops.dispatch()
> > bpf_map_pop_elem(&queue, &pid)
> > scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | dst_cpu)
> > ==> ops.dequeue() is not triggered!
> > p->scx.flags |= SCX_TASK_IN_BPF
> >
> > To fix this, we would need to always set SCX_TASK_IN_BPF before calling
> > ops.select_cpu(), and then clear it again if the task is directly
> > dispatched to a terminal DSQ from ops.select_cpu().
> >
> > However, doing so introduces further problems. In particular, we may end up
> > triggering spurious ops.dequeue() callbacks, which means we would then need
> > to distinguish whether a task entered BPF custody via ops.select_cpu() or
> > via ops.enqueue(), and handle the two cases differently. Which is also racy
> > and leads to additional locking and complexity.
> >
> > At that point, it starts to feel like we're over-complicating the design to
> > support a scenario that is both uncommon and of questionable practical
> > value.
> >
> > Given that, I'd suggest proceeding incrementally: for now, we go with
> > option (1), which looks doable without major changes and it probably fixes
> > the ops.dequeue() semantics for the majority of use cases (which is already
> > a significant improvement over the current state). Once that is in place,
> > we can revisit the "store tasks in internal BPF data structures from
> > ops.select_cpu()" scenario and see if it's worth supporting it in a cleaner
> > way. WDYT?
> >
>
> I agree with going with option 1.
>
> For the select_cpu() edge case, how about introducing an explicit
> kfunc scx_place_in_bpf_custody() later? Placing a task in BPF custody
> during select_cpu() is already pretty niche, so we can assume the
> scheduler writer knows what they're doing. In that case, let's let
> _them_ decide when in select_cpu() the task is considered "in BPF".
> They can also do their own locking to avoid races with locking on
> the task context. This keeps the state machine clean for the average
> scheduler while still handling the edge case. DYT that would work?
Yeah, I was also considering introducing dedicated kfuncs so that the BPF
scheduler can explicitly manage the "in BPF custody" state, decoupling the
notion of BPF custody from ops.enqueue(). With such interface, a scheduler
could do something like:
ops.select_cpu()
{
s32 pid = p->pid;
scx_bpf_enter_custody(p);
if (!bpf_map_push_elem(&bpf_queue, &pid, 0)) {
set_task_state(TASK_ENQUEUED);
} else {
scx_bpf_exit_custody(p);
set_task_state(TASK_NONE);
}
return prev_cpu;
}
On the implementation side, entering / leaving BPF custody is essentially
setting / clearing SCX_TASK_IN_BPF, with the scheduler taking full
responsibility for ensuring the flag is managed consistently: you set the
flag => ops.dequeue() is called when the task leaves custody, you clear the
flag => fallback to the default custody behavior.
But I think this is something to explore in the future, for now I'd go with
the easier way first. :)
Thanks,
-Andrea
Powered by blists - more mailing lists