[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aYRiM5leoHcWiXw3@gpd4>
Date: Thu, 5 Feb 2026 10:26:11 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
Kuba Piecuch <jpiecuch@...gle.com>,
Emil Tsalapatis <emil@...alapatis.com>,
Christian Loehle <christian.loehle@....com>,
Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Hi Tejun,
On Wed, Feb 04, 2026 at 12:14:40PM -1000, Tejun Heo wrote:
> Hello,
>
> On Wed, Feb 04, 2026 at 05:05:58PM +0100, Andrea Righi wrote:
> > Currently, ops.dequeue() is only invoked when the sched_ext core knows
> > that a task resides in BPF-managed data structures, which causes it to
> > miss scheduling property change events. In addition, ops.dequeue()
> > callbacks are completely skipped when tasks are dispatched to non-local
> > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> > track task state.
> >
> > Fix this by guaranteeing that each task entering the BPF scheduler's
> > custody triggers exactly one ops.dequeue() call when it leaves that
> > custody, whether the exit is due to a dispatch (regular or via a core
> > scheduling pick) or to a scheduling property change (e.g.
> > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> > balancing, etc.).
> >
> > BPF scheduler custody concept: a task is considered to be in "BPF
> > scheduler's custody" when it has been queued in user-created DSQs and
> > the BPF scheduler is responsible for its lifecycle. Custody ends when
> > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> > selected by core scheduling, or removed due to a property change.
> >
> > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> > entirely and are not in its custody. Terminal DSQs include:
> > - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> > where tasks go directly to execution.
> > - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
> > BPF scheduler is considered "done" with the task.
> >
> > As a result, ops.dequeue() is not invoked for tasks dispatched to
> > terminal DSQs, as the BPF scheduler no longer retains custody of them.
> >
> > To identify dequeues triggered by scheduling property changes, introduce
> > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> > the dequeue was caused by a scheduling property change.
> >
> ...
> > + **Property Change Notifications for Running Tasks**:
> > +
> > + For tasks that have left BPF custody (running or on terminal DSQs),
> > + property changes can be intercepted through the dedicated callbacks:
>
> I'm not sure this section is necessary. The way it's phrased makes it sound
> like schedulers would use DEQ_SCHED_CHANGE to process property changes but
> that's not the case. Relevant property changes will be notified in whatever
> ways they're notified and a task being dequeued for SCHED_CHANGE doesn't
> necessarily mean there will be an associated property change event either.
> e.g. We don't do anything re. on sched_setnuma().
Agreed, this section is a bit misleading, DEQ_SCHED_CHANGE is an
informational flag indicating the ops.dequeue() wasn't due to dispatch,
schedulers shouldn't use it to process property changes. I'll remove it.
>
> > @@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> > dsq_mod_nr(dsq, 1);
> > p->scx.dsq = dsq;
> >
> > + /*
> > + * Mark task as in BPF scheduler's custody if being queued to a
> > + * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are
> > + * terminal: tasks on them have left BPF custody.
> > + *
> > + * Don't touch the flag if already set (e.g., by
> > + * mark_direct_dispatch() or direct_dispatch()/finish_dispatch()
> > + * for user DSQs).
> > + */
> > + if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
>
> given that this is tied to dequeue, maybe a more direct name would be less
> confusing? e.g. something like SCX_TASK_NEED_DEQ?
Ack.
>
> > @@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch,
> >
> > p->scx.ddsp_dsq_id = dsq_id;
> > p->scx.ddsp_enq_flags = enq_flags;
> > +
> > + /*
> > + * Mark the task as entering BPF scheduler's custody if it's being
> > + * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This
> > + * handles the case where ops.select_cpu() directly dispatches - even
> > + * though ops.enqueue() won't be called, the task enters BPF custody
> > + * if dispatched to a user DSQ and should get ops.dequeue() when it
> > + * leaves.
> > + *
> > + * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag
> > + * is clear since the BPF scheduler is done with the task.
> > + */
> > + if (SCX_HAS_OP(sch, dequeue)) {
> > + if (!is_terminal_dsq(dsq_id))
> > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > + else
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > + }
>
> Hmm... I'm a bit confused on why this needs to be in mark_direct_dispatch()
> AND dispatch_enqueue(). The flag should be clear when off SCX. The only
> places where it could be set is from the enqueue path - when a task is
> direct dispatched to a non-terminal DSQ or BPF. Both cases can be reliably
> captured in do_enqueue_task(), no?
You're right. I was incorrectly assuming we needed this in
mark_direct_dispatch() to catch direct dispatches to user DSQs from
ops.select_cpu(), but that's not true. All paths go through
do_enqueue_task() which funnels to dispatch_enqueue(), so we can handle it
all in one place.
>
> > static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> > @@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> ...
> > + if (SCX_HAS_OP(sch, dequeue)) {
> > + if (!is_terminal_dsq(dsq->id)) {
> > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > + } else {
> > + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > + }
> > + }
>
> And when would direct_dispatch() need to call ops.dequeue()?
> direct_dispatch() is only used from do_enqueue_task() and there can only be
> one direct dispatch attempt on any given enqueue event. A task being
> enqueued shouldn't have the OPS_ENQUEUED set and would get dispatched once
> to either a terminal or non-terminal DSQ. If terminal, there's nothing to
> do. If non-terminal, the flag would need to be set. Am I missing something?
Nah, you're right, direct_dispatch() doesn't need to call ops.dequeue() or
manage the flag. I'll remove all the flag management from direct_dispatch()
and centralize it in dispatch_enqueue().
>
> > @@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> ...
> > + if (SCX_HAS_OP(sch, dequeue) &&
> > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
>
> nit: () around & expression.
>
> > + u64 flags = deq_flags;
> > +
> > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > + flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > + }
> > break;
> > case SCX_OPSS_QUEUEING:
> > /*
> > @@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> > */
> > BUG();
> > case SCX_OPSS_QUEUED:
> > - if (SCX_HAS_OP(sch, dequeue))
> > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > - p, deq_flags);
> > + /*
> > + * Task is still on the BPF scheduler (not dispatched yet).
> > + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> > + * only for property changes, not for core-sched picks or
> > + * sleep.
> > + *
> > + * Clear the flag after calling ops.dequeue(): the task is
> > + * leaving BPF scheduler's custody.
> > + */
> > + if (SCX_HAS_OP(sch, dequeue)) {
> > + u64 flags = deq_flags;
> > +
> > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > + flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
>
> I wonder whether this and the above block can be factored somehow.
Ack, we can add a helper for this.
>
> > @@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> > struct scx_dispatch_q *src_dsq,
> > struct rq *dst_rq)
> > {
> > + struct scx_sched *sch = scx_root;
> > struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
> >
> > /* @dsq is locked and @p is on @dst_rq */
> > @@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >
> > WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> >
> > + /*
> > + * Task is moving from a non-local DSQ to a local DSQ. Call
> > + * ops.dequeue() if the task was in BPF custody.
> > + */
> > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > + }
> > +
> > if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
> > list_add(&p->scx.dsq_list.node, &dst_dsq->list);
> > else
> > @@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
> >
> > BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
> >
> > + /*
> > + * Handle ops.dequeue() based on destination DSQ.
> > + *
> > + * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF
> > + * scheduler is done with the task. Call ops.dequeue() if it was in
> > + * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag.
> > + *
> > + * Dispatch to user DSQs: task is in BPF scheduler's custody.
> > + * Mark it so ops.dequeue() will be called when it leaves.
> > + */
> > + if (SCX_HAS_OP(sch, dequeue)) {
> > + if (!is_terminal_dsq(dsq_id)) {
> > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > + } else {
>
> Let's do "if (COND) { A } else { B }" instead of "if (!COND) { B } else { A
> }". Continuing from earlier, I don't understand why we'd need to set
> OPS_ENQUEUED here. Given that a transition to a terminal DSQ is terminal, I
> can't think of conditions where we'd need to set OPS_ENQUEUED from
> ops.dispatch().
Right, a task that reaches ops.dispatch() is already in QUEUED state, if
it's in a user DSQ the flag is already set from when it was enqueued, so
there's no need to set the flag in finish_dispatch().
>
> > + /*
> > + * Locking: we're holding the @rq lock (the
> > + * dispatch CPU's rq), but not necessarily
> > + * task_rq(p), since @p may be from a remote CPU.
> > + *
> > + * This is safe because SCX_OPSS_DISPATCHING state
> > + * prevents racing dequeues, any concurrent
> > + * ops_dequeue() will wait for this state to clear.
> > + */
> > + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > +
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > + }
> > + }
>
> I'm not sure finish_dispatch() is the right place to do this. e.g.
> scx_bpf_dsq_move() can also move tasks from a user DSQ to a terminal DSQ and
> the above wouldn't cover it. Wouldn't it make more sense to do this in
> dispatch_enqueue()?
Agreed.
>
> > @@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p)
> >
> > lockdep_assert_rq_held(rq);
> >
> > + /*
> > + * Clear enqueue/dequeue tracking flags when enabling the task.
> > + * This ensures a clean state when the task enters SCX. Only needed
> > + * if ops.dequeue() is implemented.
> > + */
> > + if (SCX_HAS_OP(sch, dequeue))
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +
> > /*
> > * Set the weight before calling ops.enable() so that the scheduler
> > * doesn't see a stale value if they inspect the task struct.
> > @@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p)
> > if (SCX_HAS_OP(sch, disable))
> > SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
> > scx_set_task_state(p, SCX_TASK_READY);
> > +
> > + /*
> > + * Clear enqueue/dequeue tracking flags when disabling the task.
> > + * Only needed if ops.dequeue() is implemented.
> > + */
> > + if (SCX_HAS_OP(sch, dequeue))
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
>
> If we make the flag transitions consistent, we shouldn't need these, right?
> We can add WARN_ON_ONCE() at the head of enqueue maybe.
Correct.
Thanks for the review! I'll post a new version.
-Andrea
Powered by blists - more mailing lists