lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260121123118.964704-2-arighi@nvidia.com>
Date: Wed, 21 Jan 2026 13:25:30 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>,
	David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>
Cc: Emil Tsalapatis <emil@...alapatis.com>,
	Daniel Hodges <hodgesd@...a.com>,
	sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change scenarios. As a result, BPF schedulers
cannot reliably track task state.

In addition, some ops.dequeue() callbacks can be skipped (e.g., during
direct dispatch), so ops.enqueue() calls are not always paired with a
corresponding ops.dequeue(), potentially breaking accounting logic.

Fix this by guaranteeing that every ops.enqueue() is matched with a
corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to
distinguish dequeues triggered by scheduling property changes from those
occurring in the normal dispatch workflow.

New semantics:
1. ops.enqueue() is called when a task enters the BPF scheduler
2. ops.dequeue() is called when the task leaves the BPF scheduler,
   because it is dispatched to a DSQ (regular workflow)
3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF
   scheduler, because a task property is changed (sched_change)

The SCX_DEQ_ASYNC flag allows BPF schedulers to distinguish between a
regular dispatch workflow and a task property changes (e.g.,
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, CPU migrations, etc.).

This allows BPF schedulers to:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of enqueue/dequeue pairs,
- update internal state when tasks change properties.

Cc: Tejun Heo <tj@...nel.org>
Cc: Emil Tsalapatis <emil@...alapatis.com>
Signed-off-by: Andrea Righi <arighi@...dia.com>
---
 Documentation/scheduler/sched-ext.rst         | 33 ++++++++++
 include/linux/sched/ext.h                     | 11 ++++
 kernel/sched/ext.c                            | 63 ++++++++++++++++++-
 kernel/sched/ext_internal.h                   |  6 ++
 .../sched_ext/include/scx/enum_defs.autogen.h |  2 +
 .../sched_ext/include/scx/enums.autogen.bpf.h |  2 +
 tools/sched_ext/include/scx/enums.autogen.h   |  1 +
 7 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..960125c1439ab 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed.
 
    * Queue the task on the BPF side.
 
+   Once ``ops.enqueue()`` is called, the task enters the "enqueued state".
+   The task remains in this state until ``ops.dequeue()`` is called, which
+   happens in two cases:
+
+   1. **Regular dispatch workflow**: when the task is successfully
+      dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()``
+      is triggered immediately to notify the BPF scheduler.
+
+   2. **Scheduling property change**: when a task property changes (via
+      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+      priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called
+      with the ``SCX_DEQ_ASYNC`` flag set in ``deq_flags``.
+
+   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
+   regardless of whether the task is still on a BPF data structure, or it
+   has already been dispatched to a DSQ. This guarantees that every
+   ``ops.enqueue()`` will eventually be followed by a corresponding
+   ``ops.dequeue()``.
+
+   The ``SCX_DEQ_ASYNC`` flag allows BPF schedulers to distinguish between:
+   - normal dispatch workflow (task successfully dispatched to a DSQ),
+   - asynchronous dequeues (``SCX_DEQ_ASYNC``): task property changes that
+     require the scheduler to update its internal state.
+
+   This makes it reliable for BPF schedulers to track the enqueued state
+   and maintain accurate accounting.
+
+   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
+   don't need to track these transitions. The sched_ext core will safely
+   handle all dequeue operations regardless.
+
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
    empty, it then looks at the global DSQ. If there still isn't a task to
    run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +350,8 @@ by a sched_ext scheduler:
                 /* Any usable CPU becomes available */
 
                 ops.dispatch(); /* Task is moved to a local DSQ */
+
+                ops.dequeue(); /* Exiting BPF scheduler */
             }
             ops.running();      /* Task starts running on its assigned CPU */
             while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..f3094b4a72a56 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,8 +84,19 @@ struct scx_dispatch_q {
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
+	/*
+	 * Set when ops.enqueue() is called; used to determine if ops.dequeue()
+	 * should be invoked when transitioning out of SCX_OPSS_NONE state.
+	 */
+	SCX_TASK_OPS_ENQUEUED	= 1 << 1,
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
+	/*
+	 * Set when ops.dequeue() is called after successful dispatch; used to
+	 * distinguish dispatch dequeues from async dequeues (property changes)
+	 * and to prevent duplicate dequeue calls.
+	 */
+	SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
 
 	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
 	SCX_TASK_STATE_BITS	= 2,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 809f774183202..ac13115c463d2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1289,6 +1289,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 
 	p->scx.ddsp_enq_flags |= enq_flags;
 
+	/*
+	 * The task is about to be dispatched. If ops.enqueue() was called,
+	 * notify the BPF scheduler by calling ops.dequeue().
+	 *
+	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+	 * changes can trigger ops.dequeue() with %SCX_DEQ_ASYNC. Mark that
+	 * the dispatch dequeue has been called to distinguish from
+	 * property change dequeues.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+	}
+
 	/*
 	 * We are in the enqueue path with @rq locked and pinned, and thus can't
 	 * double lock a remote rq and enqueue to its local DSQ. For
@@ -1393,6 +1407,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
 	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
 
+	/*
+	 * Mark that ops.enqueue() is being called for this task.
+	 * Clear the dispatch dequeue flag for the new enqueue cycle.
+	 * Only track these flags if ops.dequeue() is implemented.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+		p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED;
+	}
+
 	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
 	WARN_ON_ONCE(*ddsp_taskp);
 	*ddsp_taskp = p;
@@ -1529,6 +1553,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 
 	switch (opss & SCX_OPSS_STATE_MASK) {
 	case SCX_OPSS_NONE:
+		if (SCX_HAS_OP(sch, dequeue) &&
+		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
+			bool is_async_dequeue =
+				!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC));
+
+			if (is_async_dequeue)
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
+						 p, deq_flags | SCX_DEQ_ASYNC);
+			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+					  SCX_TASK_DISPATCH_DEQUEUED);
+		}
 		break;
 	case SCX_OPSS_QUEUEING:
 		/*
@@ -1537,9 +1572,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
-		if (SCX_HAS_OP(sch, dequeue))
+		/*
+		 * Task is in the enqueued state. This is a property change
+		 * dequeue before dispatch completes. Notify the BPF scheduler
+		 * with SCX_DEQ_ASYNC flag.
+		 */
+		if (SCX_HAS_OP(sch, dequeue)) {
 			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
-					 p, deq_flags);
+					 p, deq_flags | SCX_DEQ_ASYNC);
+			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+					  SCX_TASK_DISPATCH_DEQUEUED);
+		}
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
@@ -2113,6 +2156,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 
 	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
 
+	/*
+	 * The task is about to be dispatched. If ops.enqueue() was called,
+	 * notify the BPF scheduler by calling ops.dequeue().
+	 *
+	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+	 * changes can trigger ops.dequeue() with %SCX_DEQ_ASYNC. Mark that
+	 * the dispatch dequeue has been called to distinguish from
+	 * property change dequeues.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+		struct rq *task_rq = task_rq(p);
+
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0);
+		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+	}
+
 	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
 
 	if (dsq->id == SCX_DSQ_LOCAL)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..068c7c2892a16 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,12 @@ enum scx_deq_flags {
 	 * it hasn't been dispatched yet. Dequeue from the BPF side.
 	 */
 	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
+
+	/*
+	 * The task is being dequeued due to an asynchronous event (e.g.,
+	 * property change via sched_setaffinity(), priority change, etc.).
+	 */
+	SCX_DEQ_ASYNC		= 1LLU << 33,
 };
 
 enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..17d8f4324b856 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
 #define HAVE_SCX_DEQ_SLEEP
 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_ASYNC
 #define HAVE_SCX_DSQ_FLAG_BUILTIN
 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
 #define HAVE_SCX_DSQ_INVALID
@@ -48,6 +49,7 @@
 #define HAVE_SCX_TASK_QUEUED
 #define HAVE_SCX_TASK_RESET_RUNNABLE_AT
 #define HAVE_SCX_TASK_DEQD_FOR_SLEEP
+#define HAVE_SCX_TASK_DISPATCH_DEQUEUED
 #define HAVE_SCX_TASK_STATE_SHIFT
 #define HAVE_SCX_TASK_STATE_BITS
 #define HAVE_SCX_TASK_STATE_MASK
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..b3ecd6783d1e5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
 
+const volatile u64 __SCX_DEQ_ASYNC __weak;
+#define SCX_DEQ_ASYNC __SCX_DEQ_ASYNC
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..89359ab65cd3c 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_ASYNC); \
 } while (0)
-- 
2.52.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ