[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYcFOVlJhUU5huNd@gpd4>
Date: Sat, 7 Feb 2026 10:26:17 +0100
From: Andrea Righi <arighi@...dia.com>
To: Emil Tsalapatis <emil@...alapatis.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>,
Kuba Piecuch <jpiecuch@...gle.com>,
Christian Loehle <christian.loehle@....com>,
Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Hi Emil,
On Fri, Feb 06, 2026 at 03:35:34PM -0500, Emil Tsalapatis wrote:
> On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
...
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..c48f818eee9b8 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> > /* scx_entity.flags */
> > enum scx_ent_flags {
> > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
> > + SCX_TASK_NEED_DEQ = 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */
>
> Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be
> in BPF custody vs the core scx scheduler (terminal DSQs) this is a more
> general property that can be useful to check in the future. An example:
> We can now assert that a task's BPF state is consistent with its actual
> kernel state when using BPF-based data structures to manage tasks.
Ack. I like SCX_TASK_IN_BPF and I also like the idea of resuing the flag
for other purposes. It can be helpful for debugging as well.
>
> > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
> >
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 0bb8fa927e9e9..d17fd9141adf4 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
> > #endif
> > }
> >
> > +/**
> > + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
> > + * @dsq_id: DSQ ID to check
> > + *
> > + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
> > + * scheduler is considered "done" with the task.
> > + *
> > + * Builtin DSQs include:
> > + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> > + * where tasks go directly to execution,
> > + * - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
> > + * - Bypass DSQ: used during bypass mode.
> > + *
> > + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
> > + * trigger ops.dequeue() when they are later consumed.
> > + */
> > +static inline bool is_terminal_dsq(u64 dsq_id)
> > +{
> > + return dsq_id & SCX_DSQ_FLAG_BUILTIN;
> > +}
> > +
> > /**
> > * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
> > * @rq: rq to read clock from, must be locked
> > @@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p
> > resched_curr(rq);
> > }
> >
> > -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> > +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> > + struct scx_dispatch_q *dsq,
> > struct task_struct *p, u64 enq_flags)
> > {
> > bool is_local = dsq->id == SCX_DSQ_LOCAL;
> > @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> > dsq_mod_nr(dsq, 1);
> > p->scx.dsq = dsq;
> >
> > + /*
> > + * Handle ops.dequeue() and custody tracking.
> > + *
> > + * Builtin DSQs (local, global, bypass) are terminal: the BPF
> > + * scheduler is done with the task. If it was in BPF custody, call
> > + * ops.dequeue() and clear the flag.
> > + *
> > + * User DSQs: Task is in BPF scheduler's custody. Set the flag so
> > + * ops.dequeue() will be called when it leaves.
> > + */
> > + if (SCX_HAS_OP(sch, dequeue)) {
> > + if (is_terminal_dsq(dsq->id)) {
> > + if (p->scx.flags & SCX_TASK_NEED_DEQ)
> > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> > + rq, p, 0);
> > + p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> > + } else {
> > + p->scx.flags |= SCX_TASK_NEED_DEQ;
> > + }
> > + }
> > +
> > /*
> > * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
> > * direct dispatch path, but we clear them here because the direct
> > @@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> > return;
> > }
> >
> > - dispatch_enqueue(sch, dsq, p,
> > + dispatch_enqueue(sch, rq, dsq, p,
> > p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
> > }
> >
> > @@ -1407,13 +1450,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> > * dequeue may be waiting. The store_release matches their load_acquire.
> > */
> > atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
> > +
> > + /*
> > + * Task is now in BPF scheduler's custody (queued on BPF internal
> > + * structures). Set %SCX_TASK_NEED_DEQ so ops.dequeue() is called
> > + * when it leaves custody (e.g. dispatched to a terminal DSQ or on
> > + * property change).
> > + */
> > + if (SCX_HAS_OP(sch, dequeue))
>
> Related to the rename: Can we remove the guards and track the flag
> regardless of whether ops.dequeue() is present?
>
> There is no reason not to track whether a task is in BPF or the core,
> and it is a property that's independent of whether we implement ops.dequeue().
> This also simplifies the code since we now just guard the actual ops.dequeue()
> call.
I was concerned about introducing overhead, with the guard we can save a
few memory writes to p->scx.flags. But I don't have numbers and probably
the overhead is negligible.
Also, if we have a working ops.dequeue(), I guess more schedulers will
start implementing an ops.dequeue() callback, so the guard itself may
actually become the extra overhead.
So, I guess we can remove the guard and just set/clear the flag even
without an ops.dequeue() callback...
Thanks,
-Andrea
Powered by blists - more mailing lists