[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260126084258.3798129-2-arighi@nvidia.com>
Date: Mon, 26 Jan 2026 09:41:49 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>
Cc: Kuba Piecuch <jpiecuch@...gle.com>,
Christian Loehle <christian.loehle@....com>,
Daniel Hodges <hodgesd@...a.com>,
sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org,
Emil Tsalapatis <emil@...alapatis.com>
Subject: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change scenarios. As a result, BPF schedulers
cannot reliably track task state.
In addition, some ops.dequeue() callbacks can be skipped (e.g., during
direct dispatch), so ops.enqueue() calls are not always paired with a
corresponding ops.dequeue(), potentially breaking accounting logic.
Fix this by guaranteeing that every ops.enqueue() is matched with a
corresponding ops.dequeue(), and introduce the %SCX_DEQ_SCHED_CHANGE
flag to distinguish dequeues triggered by scheduling property changes
from those occurring in the normal dispatch/execution workflow.
New semantics:
1. ops.enqueue() is called when a task enters the BPF scheduler
2. ops.dequeue() is called when the task leaves the BPF scheduler in
the following cases:
a) regular dispatch workflow: task dispatched to a DSQ,
b) core scheduling pick: core-sched picks task before dispatch,
c) property change: task properties modified.
A new %SCX_DEQ_SCHED_CHANGE flag is also introduced, allowing BPF
schedulers to distinguish between:
- normal dispatch/execution workflow (dispatch, core-sched pick),
- property changes that require state updates (e.g.,
sched_setaffinity(), sched_setscheduler(), set_user_nice(),
NUMA balancing, CPU migrations, etc.).
With this, BPF schedulers can:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of enqueue/dequeue pairs,
- distinguish between execution events and property changes,
- update internal state appropriately for each dequeue type.
Cc: Tejun Heo <tj@...nel.org>
Cc: Emil Tsalapatis <emil@...alapatis.com>
Cc: Kuba Piecuch <jpiecuch@...gle.com>
Signed-off-by: Andrea Righi <arighi@...dia.com>
---
Documentation/scheduler/sched-ext.rst | 33 +++++++
include/linux/sched/ext.h | 11 +++
kernel/sched/ext.c | 89 ++++++++++++++++++-
kernel/sched/ext_internal.h | 7 ++
.../sched_ext/include/scx/enum_defs.autogen.h | 2 +
.../sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
7 files changed, 142 insertions(+), 3 deletions(-)
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..ed6bf7d9e6e8c 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed.
* Queue the task on the BPF side.
+ Once ``ops.enqueue()`` is called, the task enters the "enqueued state".
+ The task remains in this state until ``ops.dequeue()`` is called, which
+ happens in the following cases:
+
+ 1. **Regular dispatch workflow**: when the task is successfully
+ dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()``
+ is triggered immediately to notify the BPF scheduler.
+
+ 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+ core scheduling picks a task for execution before it has been
+ dispatched, ``ops.dequeue()`` is called with the
+ ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+ 3. **Scheduling property change**: when a task property changes (via
+ operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+ priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called
+ with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+ **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
+ regardless of whether the task is still on a BPF data structure, or it
+ has already been dispatched to a DSQ. This guarantees that every
+ ``ops.enqueue()`` will eventually be followed by a corresponding
+ ``ops.dequeue()``.
+
+ This makes it reliable for BPF schedulers to track the enqueued state
+ and maintain accurate accounting.
+
+ BPF schedulers can choose not to implement ``ops.dequeue()`` if they
+ don't need to track these transitions. The sched_ext core will safely
+ handle all dequeue operations regardless.
+
3. When a CPU is ready to schedule, it first looks at its local DSQ. If
empty, it then looks at the global DSQ. If there still isn't a task to
run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +350,8 @@ by a sched_ext scheduler:
/* Any usable CPU becomes available */
ops.dispatch(); /* Task is moved to a local DSQ */
+
+ ops.dequeue(); /* Exiting BPF scheduler */
}
ops.running(); /* Task starts running on its assigned CPU */
while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..59446cd0373fa 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,8 +84,19 @@ struct scx_dispatch_q {
/* scx_entity.flags */
enum scx_ent_flags {
SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
+ /*
+ * Set when ops.enqueue() is called; used to determine if ops.dequeue()
+ * should be invoked when transitioning out of SCX_OPSS_NONE state.
+ */
+ SCX_TASK_OPS_ENQUEUED = 1 << 1,
SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
+ /*
+ * Set when ops.dequeue() is called after successful dispatch; used to
+ * distinguish dispatch dequeues from property change dequeues and
+ * prevent duplicate dequeue calls.
+ */
+ SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */
SCX_TASK_STATE_BITS = 2,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index afe28c04d5aa7..18bca2b83f5c5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
p->scx.ddsp_enq_flags |= enq_flags;
+ /*
+ * The task is about to be dispatched. If ops.enqueue() was called,
+ * notify the BPF scheduler by calling ops.dequeue().
+ *
+ * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+ * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
+ * Mark that the dispatch dequeue has been called to distinguish
+ * from property change dequeues.
+ */
+ if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+ p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+ }
+
/*
* We are in the enqueue path with @rq locked and pinned, and thus can't
* double lock a remote rq and enqueue to its local DSQ. For
@@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
+ /*
+ * Mark that ops.enqueue() is being called for this task.
+ * Clear the dispatch dequeue flag for the new enqueue cycle.
+ * Only track these flags if ops.dequeue() is implemented.
+ */
+ if (SCX_HAS_OP(sch, dequeue)) {
+ p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+ p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED;
+ }
+
ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
WARN_ON_ONCE(*ddsp_taskp);
*ddsp_taskp = p;
@@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
switch (opss & SCX_OPSS_STATE_MASK) {
case SCX_OPSS_NONE:
+ if (SCX_HAS_OP(sch, dequeue) &&
+ p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
+ /*
+ * Task was already dispatched. Only call ops.dequeue()
+ * if it hasn't been called yet (check DISPATCH_DEQUEUED).
+ * This can happen when:
+ * 1. Core-sched picks a task that was dispatched
+ * 2. Property changes occur after dispatch
+ */
+ if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) {
+ /*
+ * ops.dequeue() wasn't called during dispatch.
+ * This shouldn't normally happen, but call it now.
+ */
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
+ p, deq_flags);
+ } else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) {
+ /*
+ * This is a property change after
+ * dispatch. Call ops.dequeue() again with
+ * %SCX_DEQ_SCHED_CHANGE.
+ */
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
+ p, deq_flags | SCX_DEQ_SCHED_CHANGE);
+ }
+ p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+ SCX_TASK_DISPATCH_DEQUEUED);
+ }
break;
case SCX_OPSS_QUEUEING:
/*
@@ -1531,9 +1583,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
*/
BUG();
case SCX_OPSS_QUEUED:
- if (SCX_HAS_OP(sch, dequeue))
- SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
- p, deq_flags);
+ /*
+ * Task is still on the BPF scheduler (not dispatched yet).
+ * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
+ * only for property changes, not for core-sched picks.
+ */
+ if (SCX_HAS_OP(sch, dequeue)) {
+ u64 flags = deq_flags;
+ /*
+ * Add %SCX_DEQ_SCHED_CHANGE for property changes,
+ * but not for core-sched picks or sleep.
+ */
+ if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+ flags |= SCX_DEQ_SCHED_CHANGE;
+
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+ p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+ SCX_TASK_DISPATCH_DEQUEUED);
+ }
if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
SCX_OPSS_NONE))
@@ -2107,6 +2174,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
+ /*
+ * The task is about to be dispatched. If ops.enqueue() was called,
+ * notify the BPF scheduler by calling ops.dequeue().
+ *
+ * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+ * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
+ * Mark that the dispatch dequeue has been called to distinguish
+ * from property change dequeues.
+ */
+ if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+ struct rq *task_rq = task_rq(p);
+
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0);
+ p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+ }
+
dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
if (dsq->id == SCX_DSQ_LOCAL)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..befa9a5d6e53f 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,13 @@ enum scx_deq_flags {
* it hasn't been dispatched yet. Dequeue from the BPF side.
*/
SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32,
+
+ /*
+ * The task is being dequeued due to a property change (e.g.,
+ * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+ * etc.).
+ */
+ SCX_DEQ_SCHED_CHANGE = 1LLU << 33,
};
enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..8284f717ff05e 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
#define HAVE_SCX_CPU_PREEMPT_UNKNOWN
#define HAVE_SCX_DEQ_SLEEP
#define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
#define HAVE_SCX_DSQ_FLAG_BUILTIN
#define HAVE_SCX_DSQ_FLAG_LOCAL_ON
#define HAVE_SCX_DSQ_INVALID
@@ -48,6 +49,7 @@
#define HAVE_SCX_TASK_QUEUED
#define HAVE_SCX_TASK_RESET_RUNNABLE_AT
#define HAVE_SCX_TASK_DEQD_FOR_SLEEP
+#define HAVE_SCX_TASK_DISPATCH_DEQUEUED
#define HAVE_SCX_TASK_STATE_SHIFT
#define HAVE_SCX_TASK_STATE_BITS
#define HAVE_SCX_TASK_STATE_MASK
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..5da50f9376844 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
#define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..fc9a7a4d9dea5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+ SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
} while (0)
--
2.52.0
Powered by blists - more mailing lists