[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aVKt-cmkOxPe3MI6@gpd4>
Date: Mon, 29 Dec 2025 17:36:09 +0100
From: Andrea Righi <arighi@...dia.com>
To: Emil Tsalapatis <emil@...alapatis.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>,
Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Hi Emil,
On Sat, Dec 27, 2025 at 10:20:06PM -0500, Emil Tsalapatis wrote:
> On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote:
> > Properly implement ops.dequeue() to ensure every ops.enqueue() is
> > balanced by a corresponding ops.dequeue() call, regardless of whether
> > the task is on a BPF data structure or already dispatched to a DSQ.
> >
> > A task is considered enqueued when it is owned by the BPF scheduler.
> > This ownership persists until the task is either dispatched (moved to a
> > local DSQ for execution) or removed from the BPF scheduler, such as when
> > it blocks waiting for an event or when its properties (for example, CPU
> > affinity or priority) are updated.
> >
> > When the task enters the BPF scheduler ops.enqueue() is invoked, when it
> > leaves BPF scheduler ownership, ops.dequeue() is invoked.
> >
> > This allows BPF schedulers to reliably track task ownership and maintain
> > accurate accounting.
> >
> > Cc: Emil Tsalapatis <emil@...alapatis.com>
> > Signed-off-by: Andrea Righi <arighi@...dia.com>
> > ---
>
>
> Hi Andrea,
>
> This change looks reasonable to me. Some comments inline:
>
> > Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++
> > include/linux/sched/ext.h | 1 +
> > kernel/sched/ext.c | 27 ++++++++++++++++++++++++++-
> > 3 files changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..3ed4be53f97da 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed.
> >
> > * Queue the task on the BPF side.
> >
> > + Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> > + is owned by the BPF scheduler. Ownership is retained until the task is
> > + either dispatched (moved to a local DSQ for execution) or dequeued
> > + (removed from the scheduler due to a blocking event, or to modify a
> > + property, like CPU affinity, priority, etc.). When the task leaves the
> > + BPF scheduler ``ops.dequeue()`` is invoked.
> > +
>
> Can we say "leaves the scx class" instead? On direct dispatch we
> technically never insert the task into the BPF scheduler.
Hm.. I agree that'd be more accurate, but it might also be slightly
misleading, as it could be interpreted as the task being moved to a
different scheduling class. How about saying "leaves the enqueued state"
instead, where enqueued means ops.enqueue() being called... I can't find a
better name for this state, like "ops_enqueued", but that's be even more
confusing. :)
>
> > + **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> > + regardless of whether the task is still on a BPF data structure, or it
> > + is already dispatched to a DSQ (global, local, or user DSQ)
> > +
> > + This guarantees that every ``ops.enqueue()`` will eventually be followed
> > + by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
> > + track task ownership and maintain accurate accounting, such as per-DSQ
> > + queued runtime sums.
> > +
> > + BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> > + don't need to track these transitions. The sched_ext core will safely
> > + handle all dequeue operations regardless.
> > +
> > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> > empty, it then looks at the global DSQ. If there still isn't a task to
> > run, ``ops.dispatch()`` is invoked which can use the following two
> > @@ -319,6 +339,8 @@ by a sched_ext scheduler:
> > /* Any usable CPU becomes available */
> >
> > ops.dispatch(); /* Task is moved to a local DSQ */
> > +
> > + ops.dequeue(); /* Exiting BPF scheduler */
> > }
> > ops.running(); /* Task starts running on its assigned CPU */
> > while (task->scx.slice > 0 && task is runnable)
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..334c3692a9c62 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> > /* scx_entity.flags */
> > enum scx_ent_flags {
> > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
> > + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* ops.enqueue() was called */
>
> Can we rename this flag? For direct dispatch we never got enqueued.
> Something like "DEQ_ON_DISPATCH" would show the purpose of the
> flag more clearly.
Good point. However, ops.dequeue() isn't only called on dispatch, it can
also be triggered when a task property is changed.
So the flag should represent the "enqueued state" in the sense that
ops.enqueue() has been called and a corresponding ops.dequeue() is
expected. This is a lifecycle state, not an indication that the task is in
any queue.
Would a more descriptive comment clarify this? Something like:
SCX_TASK_OPS_ENQUEUED = 1 << 1, /* Task in enqueued state: ops.enqueue()
called, ops.dequeue() will be called
when task leaves this state. */
>
> > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
> >
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 94164f2dec6dc..985d75d374385 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> > WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> > atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
> >
> > + /* Mark that ops.enqueue() is being called for this task */
> > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +
>
> Can we avoid setting this flag when we have no .dequeue() method?
> Otherwise it stays set forever AFAICT, even after the task has been
> sent to the runqueues.
Good catch! Definitely we don't need to set this for schedulers that don't
implement ops.dequeue().
>
> > ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
> > WARN_ON_ONCE(*ddsp_taskp);
> > *ddsp_taskp = p;
> > @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >
> > switch (opss & SCX_OPSS_STATE_MASK) {
> > case SCX_OPSS_NONE:
> > + /*
> > + * Task is not currently being enqueued or queued on the BPF
> > + * scheduler. Check if ops.enqueue() was called for this task.
> > + */
> > + if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) &&
> > + SCX_HAS_OP(sch, dequeue)) {
> > + /*
> > + * ops.enqueue() was called and the task was dispatched.
> > + * Call ops.dequeue() to notify the BPF scheduler that
> > + * the task is leaving.
> > + */
> > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > + p, deq_flags);
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > + }
> > break;
> > case SCX_OPSS_QUEUEING:
> > /*
> > @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> > */
> > BUG();
> > case SCX_OPSS_QUEUED:
> > - if (SCX_HAS_OP(sch, dequeue))
> > + /*
> > + * Task is owned by the BPF scheduler. Call ops.dequeue()
> > + * to notify the BPF scheduler that the task is being
> > + * removed.
> > + */
> > + if (SCX_HAS_OP(sch, dequeue)) {
>
> Edge case, but if we have a .dequeue() method but not an .enqueue() we
> still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an
> extra condition to be consistent with the SCX_OPSS_NONE case above?
Also good catch. Will add that.
>
> > SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > p, deq_flags);
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > + }
> >
> > if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
> > SCX_OPSS_NONE))
>
Thanks,
-Andrea
Powered by blists - more mailing lists