[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260121123118.964704-2-arighi@nvidia.com>
Date: Wed, 21 Jan 2026 13:25:30 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>
Cc: Emil Tsalapatis <emil@...alapatis.com>,
Daniel Hodges <hodgesd@...a.com>,
sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change scenarios. As a result, BPF schedulers
cannot reliably track task state.
In addition, some ops.dequeue() callbacks can be skipped (e.g., during
direct dispatch), so ops.enqueue() calls are not always paired with a
corresponding ops.dequeue(), potentially breaking accounting logic.
Fix this by guaranteeing that every ops.enqueue() is matched with a
corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to
distinguish dequeues triggered by scheduling property changes from those
occurring in the normal dispatch workflow.
New semantics:
1. ops.enqueue() is called when a task enters the BPF scheduler
2. ops.dequeue() is called when the task leaves the BPF scheduler,
because it is dispatched to a DSQ (regular workflow)
3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF
scheduler, because a task property is changed (sched_change)
The SCX_DEQ_ASYNC flag allows BPF schedulers to distinguish between a
regular dispatch workflow and a task property changes (e.g.,
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, CPU migrations, etc.).
This allows BPF schedulers to:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of enqueue/dequeue pairs,
- update internal state when tasks change properties.
Cc: Tejun Heo <tj@...nel.org>
Cc: Emil Tsalapatis <emil@...alapatis.com>
Signed-off-by: Andrea Righi <arighi@...dia.com>
---
Documentation/scheduler/sched-ext.rst | 33 ++++++++++
include/linux/sched/ext.h | 11 ++++
kernel/sched/ext.c | 63 ++++++++++++++++++-
kernel/sched/ext_internal.h | 6 ++
.../sched_ext/include/scx/enum_defs.autogen.h | 2 +
.../sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
7 files changed, 116 insertions(+), 2 deletions(-)
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..960125c1439ab 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed.
* Queue the task on the BPF side.
+ Once ``ops.enqueue()`` is called, the task enters the "enqueued state".
+ The task remains in this state until ``ops.dequeue()`` is called, which
+ happens in two cases:
+
+ 1. **Regular dispatch workflow**: when the task is successfully
+ dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()``
+ is triggered immediately to notify the BPF scheduler.
+
+ 2. **Scheduling property change**: when a task property changes (via
+ operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+ priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called
+ with the ``SCX_DEQ_ASYNC`` flag set in ``deq_flags``.
+
+ **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
+ regardless of whether the task is still on a BPF data structure, or it
+ has already been dispatched to a DSQ. This guarantees that every
+ ``ops.enqueue()`` will eventually be followed by a corresponding
+ ``ops.dequeue()``.
+
+ The ``SCX_DEQ_ASYNC`` flag allows BPF schedulers to distinguish between:
+ - normal dispatch workflow (task successfully dispatched to a DSQ),
+ - asynchronous dequeues (``SCX_DEQ_ASYNC``): task property changes that
+ require the scheduler to update its internal state.
+
+ This makes it reliable for BPF schedulers to track the enqueued state
+ and maintain accurate accounting.
+
+ BPF schedulers can choose not to implement ``ops.dequeue()`` if they
+ don't need to track these transitions. The sched_ext core will safely
+ handle all dequeue operations regardless.
+
3. When a CPU is ready to schedule, it first looks at its local DSQ. If
empty, it then looks at the global DSQ. If there still isn't a task to
run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +350,8 @@ by a sched_ext scheduler:
/* Any usable CPU becomes available */
ops.dispatch(); /* Task is moved to a local DSQ */
+
+ ops.dequeue(); /* Exiting BPF scheduler */
}
ops.running(); /* Task starts running on its assigned CPU */
while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..f3094b4a72a56 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,8 +84,19 @@ struct scx_dispatch_q {
/* scx_entity.flags */
enum scx_ent_flags {
SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
+ /*
+ * Set when ops.enqueue() is called; used to determine if ops.dequeue()
+ * should be invoked when transitioning out of SCX_OPSS_NONE state.
+ */
+ SCX_TASK_OPS_ENQUEUED = 1 << 1,
SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
+ /*
+ * Set when ops.dequeue() is called after successful dispatch; used to
+ * distinguish dispatch dequeues from async dequeues (property changes)
+ * and to prevent duplicate dequeue calls.
+ */
+ SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */
SCX_TASK_STATE_BITS = 2,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 809f774183202..ac13115c463d2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1289,6 +1289,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
p->scx.ddsp_enq_flags |= enq_flags;
+ /*
+ * The task is about to be dispatched. If ops.enqueue() was called,
+ * notify the BPF scheduler by calling ops.dequeue().
+ *
+ * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+ * changes can trigger ops.dequeue() with %SCX_DEQ_ASYNC. Mark that
+ * the dispatch dequeue has been called to distinguish from
+ * property change dequeues.
+ */
+ if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+ p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+ }
+
/*
* We are in the enqueue path with @rq locked and pinned, and thus can't
* double lock a remote rq and enqueue to its local DSQ. For
@@ -1393,6 +1407,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
+ /*
+ * Mark that ops.enqueue() is being called for this task.
+ * Clear the dispatch dequeue flag for the new enqueue cycle.
+ * Only track these flags if ops.dequeue() is implemented.
+ */
+ if (SCX_HAS_OP(sch, dequeue)) {
+ p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+ p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED;
+ }
+
ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
WARN_ON_ONCE(*ddsp_taskp);
*ddsp_taskp = p;
@@ -1529,6 +1553,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
switch (opss & SCX_OPSS_STATE_MASK) {
case SCX_OPSS_NONE:
+ if (SCX_HAS_OP(sch, dequeue) &&
+ p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
+ bool is_async_dequeue =
+ !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC));
+
+ if (is_async_dequeue)
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
+ p, deq_flags | SCX_DEQ_ASYNC);
+ p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+ SCX_TASK_DISPATCH_DEQUEUED);
+ }
break;
case SCX_OPSS_QUEUEING:
/*
@@ -1537,9 +1572,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
*/
BUG();
case SCX_OPSS_QUEUED:
- if (SCX_HAS_OP(sch, dequeue))
+ /*
+ * Task is in the enqueued state. This is a property change
+ * dequeue before dispatch completes. Notify the BPF scheduler
+ * with SCX_DEQ_ASYNC flag.
+ */
+ if (SCX_HAS_OP(sch, dequeue)) {
SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
- p, deq_flags);
+ p, deq_flags | SCX_DEQ_ASYNC);
+ p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+ SCX_TASK_DISPATCH_DEQUEUED);
+ }
if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
SCX_OPSS_NONE))
@@ -2113,6 +2156,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
+ /*
+ * The task is about to be dispatched. If ops.enqueue() was called,
+ * notify the BPF scheduler by calling ops.dequeue().
+ *
+ * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+ * changes can trigger ops.dequeue() with %SCX_DEQ_ASYNC. Mark that
+ * the dispatch dequeue has been called to distinguish from
+ * property change dequeues.
+ */
+ if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+ struct rq *task_rq = task_rq(p);
+
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0);
+ p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+ }
+
dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
if (dsq->id == SCX_DSQ_LOCAL)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..068c7c2892a16 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,12 @@ enum scx_deq_flags {
* it hasn't been dispatched yet. Dequeue from the BPF side.
*/
SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32,
+
+ /*
+ * The task is being dequeued due to an asynchronous event (e.g.,
+ * property change via sched_setaffinity(), priority change, etc.).
+ */
+ SCX_DEQ_ASYNC = 1LLU << 33,
};
enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..17d8f4324b856 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
#define HAVE_SCX_CPU_PREEMPT_UNKNOWN
#define HAVE_SCX_DEQ_SLEEP
#define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_ASYNC
#define HAVE_SCX_DSQ_FLAG_BUILTIN
#define HAVE_SCX_DSQ_FLAG_LOCAL_ON
#define HAVE_SCX_DSQ_INVALID
@@ -48,6 +49,7 @@
#define HAVE_SCX_TASK_QUEUED
#define HAVE_SCX_TASK_RESET_RUNNABLE_AT
#define HAVE_SCX_TASK_DEQD_FOR_SLEEP
+#define HAVE_SCX_TASK_DISPATCH_DEQUEUED
#define HAVE_SCX_TASK_STATE_SHIFT
#define HAVE_SCX_TASK_STATE_BITS
#define HAVE_SCX_TASK_STATE_MASK
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..b3ecd6783d1e5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
#define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
+const volatile u64 __SCX_DEQ_ASYNC __weak;
+#define SCX_DEQ_ASYNC __SCX_DEQ_ASYNC
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..89359ab65cd3c 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+ SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_ASYNC); \
} while (0)
--
2.52.0
Powered by blists - more mailing lists