[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260201091318.178710-2-arighi@nvidia.com>
Date: Sun, 1 Feb 2026 10:08:04 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>
Cc: Kuba Piecuch <jpiecuch@...gle.com>,
Emil Tsalapatis <emil@...alapatis.com>,
Christian Loehle <christian.loehle@....com>,
Daniel Hodges <hodgesd@...a.com>,
sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.
Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).
BPF scheduler custody concept: a task is considered to be in "BPF
scheduler's custody" when it has been queued in BPF-managed data
structures and the BPF scheduler is responsible for its lifecycle.
Custody ends when the task is dispatched to a local DSQ, selected by
core scheduling, or removed due to a property change.
Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or
%SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its
custody. As a result, ops.dequeue() is not invoked for these tasks.
To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.
New ops.dequeue() semantics:
- ops.dequeue() is invoked exactly once when the task leaves the BPF
scheduler's custody, in one of the following cases:
a) regular dispatch: task was dispatched to a non-local DSQ (global
or user DSQ), ops.dequeue() called without any special flags set
b) core scheduling dispatch: core-sched picks task before dispatch,
dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set
c) property change: task properties modified before dispatch,
dequeue called with %SCX_DEQ_SCHED_CHANGE flag set
This allows BPF schedulers to:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of managed tasks,
- update internal state when tasks change properties.
Cc: Tejun Heo <tj@...nel.org>
Cc: Emil Tsalapatis <emil@...alapatis.com>
Cc: Kuba Piecuch <jpiecuch@...gle.com>
Signed-off-by: Andrea Righi <arighi@...dia.com>
---
Documentation/scheduler/sched-ext.rst | 76 ++++++++
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 168 +++++++++++++++++-
kernel/sched/ext_internal.h | 7 +
.../sched_ext/include/scx/enum_defs.autogen.h | 1 +
.../sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
7 files changed, 253 insertions(+), 3 deletions(-)
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..6d9e82e6ca9d4 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed.
* Queue the task on the BPF side.
+ **Task State Tracking and ops.dequeue() Semantics**
+
+ Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
+ enter the "BPF scheduler's custody" depending on where it's dispatched:
+
+ * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or
+ ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler
+ entirely and goes straight to the CPU's local run queue. The task
+ never enters BPF custody, and ``ops.dequeue()`` will not be called.
+
+ * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs):
+ the task enters the BPF scheduler's custody. When the task later
+ leaves BPF custody (dispatched to a local DSQ, picked by core-sched,
+ or dequeued for sleep/property changes), ``ops.dequeue()`` will be
+ called exactly once.
+
+ * **Queued on BPF side**: The task is in BPF data structures and in BPF
+ custody, ``ops.dequeue()`` will be called when it leaves.
+
+ The key principle: **ops.dequeue() is called when a task leaves the BPF
+ scheduler's custody**. A task is in BPF custody if it's on a non-local
+ DSQ or in BPF data structures. Once dispatched to a local DSQ or after
+ ops.dequeue() is called, the task is out of BPF custody and the BPF
+ scheduler no longer needs to track it.
+
+ This works correctly with the ``ops.select_cpu()`` direct dispatch
+ optimization: even though it skips ``ops.enqueue()`` invocation, if the
+ task is dispatched to a non-local DSQ, it enters BPF custody and will
+ get ``ops.dequeue()`` when it leaves. This provides the performance
+ benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining
+ correct state tracking.
+
+ The dequeue can happen for different reasons, distinguished by flags:
+
+ 1. **Regular dispatch workflow**: when the task is dispatched from a
+ non-local DSQ to a local DSQ (leaving BPF custody for execution),
+ ``ops.dequeue()`` is triggered without any special flags.
+
+ 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+ core scheduling picks a task for execution while it's still in BPF
+ custody, ``ops.dequeue()`` is called with the
+ ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+ 3. **Scheduling property change**: when a task property changes (via
+ operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+ priority changes, CPU migrations, etc.) while the task is still in
+ BPF custody, ``ops.dequeue()`` is called with the
+ ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+ **Important**: Once a task has left BPF custody (dispatched to local
+ DSQ), property changes will not trigger ``ops.dequeue()``, since the
+ task is no longer being managed by the BPF scheduler.
+
+ **Property Change Notifications for Running Tasks**:
+
+ For tasks that have left BPF custody (running or on local DSQs),
+ property changes can be intercepted through the dedicated callbacks:
+
+ * ``ops.set_cpumask()``: Called when a task's CPU affinity changes
+ (e.g., via ``sched_setaffinity()``). This callback is invoked for
+ all tasks regardless of their state or BPF custody.
+
+ * ``ops.set_weight()``: Called when a task's scheduling weight/priority
+ changes (e.g., via ``sched_setscheduler()`` or ``set_user_nice()``).
+ This callback is also invoked for all tasks.
+
+ These callbacks provide complete coverage for property changes,
+ complementing ``ops.dequeue()`` which only applies to tasks in BPF
+ custody.
+
+ BPF schedulers can choose not to implement ``ops.dequeue()`` if they
+ don't need to track these transitions. The sched_ext core will safely
+ handle all dequeue operations regardless.
+
3. When a CPU is ready to schedule, it first looks at its local DSQ. If
empty, it then looks at the global DSQ. If there still isn't a task to
run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +393,8 @@ by a sched_ext scheduler:
/* Any usable CPU becomes available */
ops.dispatch(); /* Task is moved to a local DSQ */
+
+ ops.dequeue(); /* Exiting BPF scheduler */
}
ops.running(); /* Task starts running on its assigned CPU */
while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..0d003d2845393 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,6 +84,7 @@ struct scx_dispatch_q {
/* scx_entity.flags */
enum scx_ent_flags {
SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
+ SCX_TASK_OPS_ENQUEUED = 1 << 1, /* under ext scheduler's custody */
SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index afe28c04d5aa7..6d6f1253039d8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -924,6 +924,19 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
#endif
}
+/**
+ * is_local_dsq - Check if a DSQ ID represents a local DSQ
+ * @dsq_id: DSQ ID to check
+ *
+ * Returns true if @dsq_id is a local DSQ, false otherwise. Local DSQs are
+ * per-CPU queues where tasks go directly to execution.
+ */
+static inline bool is_local_dsq(u64 dsq_id)
+{
+ return dsq_id == SCX_DSQ_LOCAL ||
+ (dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON;
+}
+
/**
* touch_core_sched_dispatch - Update core-sched timestamp on dispatch
* @rq: rq to read clock from, must be locked
@@ -1274,6 +1287,24 @@ static void mark_direct_dispatch(struct scx_sched *sch,
p->scx.ddsp_dsq_id = dsq_id;
p->scx.ddsp_enq_flags = enq_flags;
+
+ /*
+ * Mark the task as entering BPF scheduler's custody if it's being
+ * dispatched to a non-local DSQ. This handles the case where
+ * ops.select_cpu() directly dispatches to a non-local DSQ - even
+ * though ops.enqueue() won't be called, the task enters BPF
+ * custody and should get ops.dequeue() when it leaves.
+ *
+ * For local DSQs, clear the flag, since the task bypasses the BPF
+ * scheduler entirely. This also clears any flag that was set by
+ * do_enqueue_task() before we knew the dispatch destination.
+ */
+ if (SCX_HAS_OP(sch, dequeue)) {
+ if (!is_local_dsq(dsq_id))
+ p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+ else
+ p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+ }
}
static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
@@ -1287,6 +1318,40 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
p->scx.ddsp_enq_flags |= enq_flags;
+ /*
+ * The task is about to be dispatched, handle ops.dequeue() based
+ * on where the task is going.
+ *
+ * Key principle: ops.dequeue() is called when a task leaves the
+ * BPF scheduler's custody. A task is in BPF custody if it's on a
+ * non-local DSQ or in BPF data structures. Once dispatched to a
+ * local DSQ, it's out of BPF custody.
+ *
+ * Direct dispatch to local DSQs: task never enters BPF scheduler's
+ * custody, it goes straight to the CPU. Don't call ops.dequeue()
+ * and clear the flag so future property changes also won't trigger
+ * it.
+ *
+ * Direct dispatch to non-local DSQs: task enters BPF scheduler's
+ * custody. Mark the task as in BPF custody so that when it's later
+ * dispatched to a local DSQ or dequeued for property changes,
+ * ops.dequeue() will be called.
+ *
+ * This also handles the ops.select_cpu() direct dispatch to
+ * non-local DSQs: the shortcut skips ops.enqueue() invocation but
+ * the task still enters BPF custody if dispatched to a non-local
+ * DSQ, and thus needs ops.dequeue() when it leaves.
+ */
+ if (SCX_HAS_OP(sch, dequeue)) {
+ if (!is_local_dsq(dsq->id)) {
+ p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+ } else {
+ if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+ p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+ }
+ }
+
/*
* We are in the enqueue path with @rq locked and pinned, and thus can't
* double lock a remote rq and enqueue to its local DSQ. For
@@ -1391,6 +1456,21 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
+ /*
+ * Mark that ops.enqueue() is being called for this task. This
+ * indicates the task is entering the BPF scheduler's data
+ * structures (QUEUED state).
+ *
+ * However, if the task was already marked as in BPF custody by
+ * mark_direct_dispatch() (ops.select_cpu() direct dispatch to
+ * non-local DSQ), don't clear that - keep the flag set so
+ * ops.dequeue() will be called when appropriate.
+ *
+ * Only track this flag if ops.dequeue() is implemented.
+ */
+ if (SCX_HAS_OP(sch, dequeue))
+ p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+
ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
WARN_ON_ONCE(*ddsp_taskp);
*ddsp_taskp = p;
@@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
switch (opss & SCX_OPSS_STATE_MASK) {
case SCX_OPSS_NONE:
+ /*
+ * Task is not in BPF data structures (either dispatched to
+ * a DSQ or running). Only call ops.dequeue() if the task
+ * is still in BPF scheduler's custody
+ * (%SCX_TASK_OPS_ENQUEUED is set).
+ *
+ * If the task has already been dispatched to a local DSQ
+ * (left BPF custody), the flag will be clear and we skip
+ * ops.dequeue()
+ *
+ * If this is a property change (not sleep/core-sched) and
+ * the task is still in BPF custody, set the
+ * %SCX_DEQ_SCHED_CHANGE flag.
+ */
+ if (SCX_HAS_OP(sch, dequeue) &&
+ p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
+ u64 flags = deq_flags;
+
+ if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+ flags |= SCX_DEQ_SCHED_CHANGE;
+
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+ p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+ }
break;
case SCX_OPSS_QUEUEING:
/*
@@ -1531,9 +1635,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
*/
BUG();
case SCX_OPSS_QUEUED:
- if (SCX_HAS_OP(sch, dequeue))
- SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
- p, deq_flags);
+ /*
+ * Task is still on the BPF scheduler (not dispatched yet).
+ * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
+ * only for property changes, not for core-sched picks or
+ * sleep.
+ *
+ * Clear the flag after calling ops.dequeue(): the task is
+ * leaving BPF scheduler's custody.
+ */
+ if (SCX_HAS_OP(sch, dequeue)) {
+ u64 flags = deq_flags;
+
+ if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+ flags |= SCX_DEQ_SCHED_CHANGE;
+
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+ p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+ }
if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
SCX_OPSS_NONE))
@@ -1630,6 +1749,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
struct scx_dispatch_q *src_dsq,
struct rq *dst_rq)
{
+ struct scx_sched *sch = scx_root;
struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
/* @dsq is locked and @p is on @dst_rq */
@@ -1638,6 +1758,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+ /*
+ * Task is moving from a non-local DSQ to a local DSQ. Call
+ * ops.dequeue() if the task was in BPF custody.
+ */
+ if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
+ p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+ }
+
if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
list_add(&p->scx.dsq_list.node, &dst_dsq->list);
else
@@ -2107,6 +2236,24 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
+ /*
+ * Direct dispatch to local DSQs: call ops.dequeue() if task was in
+ * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag.
+ *
+ * Dispatch to non-local DSQs: task is in BPF scheduler's custody.
+ * Mark it so ops.dequeue() will be called when it leaves.
+ */
+ if (SCX_HAS_OP(sch, dequeue)) {
+ if (!is_local_dsq(dsq_id)) {
+ p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+ } else {
+ if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0);
+
+ p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+ }
+ }
+
dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
if (dsq->id == SCX_DSQ_LOCAL)
@@ -2894,6 +3041,14 @@ static void scx_enable_task(struct task_struct *p)
lockdep_assert_rq_held(rq);
+ /*
+ * Clear enqueue/dequeue tracking flags when enabling the task.
+ * This ensures a clean state when the task enters SCX. Only needed
+ * if ops.dequeue() is implemented.
+ */
+ if (SCX_HAS_OP(sch, dequeue))
+ p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+
/*
* Set the weight before calling ops.enable() so that the scheduler
* doesn't see a stale value if they inspect the task struct.
@@ -2925,6 +3080,13 @@ static void scx_disable_task(struct task_struct *p)
if (SCX_HAS_OP(sch, disable))
SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
scx_set_task_state(p, SCX_TASK_READY);
+
+ /*
+ * Clear enqueue/dequeue tracking flags when disabling the task.
+ * Only needed if ops.dequeue() is implemented.
+ */
+ if (SCX_HAS_OP(sch, dequeue))
+ p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
}
static void scx_exit_task(struct task_struct *p)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..befa9a5d6e53f 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,13 @@ enum scx_deq_flags {
* it hasn't been dispatched yet. Dequeue from the BPF side.
*/
SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32,
+
+ /*
+ * The task is being dequeued due to a property change (e.g.,
+ * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+ * etc.).
+ */
+ SCX_DEQ_SCHED_CHANGE = 1LLU << 33,
};
enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..dcc945304760f 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
#define HAVE_SCX_CPU_PREEMPT_UNKNOWN
#define HAVE_SCX_DEQ_SLEEP
#define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
#define HAVE_SCX_DSQ_FLAG_BUILTIN
#define HAVE_SCX_DSQ_FLAG_LOCAL_ON
#define HAVE_SCX_DSQ_INVALID
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..5da50f9376844 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
#define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..fc9a7a4d9dea5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+ SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
} while (0)
--
2.52.0
Powered by blists - more mailing lists