linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aYPE0GRJEu5gdBVl@slm.duckdns.org>
Date: Wed, 4 Feb 2026 12:14:40 -1000
From: Tejun Heo <tj@...nel.org>
To: Andrea Righi <arighi@...dia.com>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
	Kuba Piecuch <jpiecuch@...gle.com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hello,

On Wed, Feb 04, 2026 at 05:05:58PM +0100, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change events. In addition, ops.dequeue()
> callbacks are completely skipped when tasks are dispatched to non-local
> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> track task state.
> 
> Fix this by guaranteeing that each task entering the BPF scheduler's
> custody triggers exactly one ops.dequeue() call when it leaves that
> custody, whether the exit is due to a dispatch (regular or via a core
> scheduling pick) or to a scheduling property change (e.g.
> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> balancing, etc.).
> 
> BPF scheduler custody concept: a task is considered to be in "BPF
> scheduler's custody" when it has been queued in user-created DSQs and
> the BPF scheduler is responsible for its lifecycle. Custody ends when
> the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> selected by core scheduling, or removed due to a property change.
> 
> Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> entirely and are not in its custody. Terminal DSQs include:
>  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
>    where tasks go directly to execution.
>  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
>    BPF scheduler is considered "done" with the task.
> 
> As a result, ops.dequeue() is not invoked for tasks dispatched to
> terminal DSQs, as the BPF scheduler no longer retains custody of them.
> 
> To identify dequeues triggered by scheduling property changes, introduce
> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> the dequeue was caused by a scheduling property change.
> 
...
> +   **Property Change Notifications for Running Tasks**:
> +
> +   For tasks that have left BPF custody (running or on terminal DSQs),
> +   property changes can be intercepted through the dedicated callbacks:

I'm not sure this section is necessary. The way it's phrased makes it sound
like schedulers would use DEQ_SCHED_CHANGE to process property changes but
that's not the case. Relevant property changes will be notified in whatever
ways they're notified and a task being dequeued for SCHED_CHANGE doesn't
necessarily mean there will be an associated property change event either.
e.g. We don't do anything re. on sched_setnuma().

> @@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  	dsq_mod_nr(dsq, 1);
>  	p->scx.dsq = dsq;
>  
> +	/*
> +	 * Mark task as in BPF scheduler's custody if being queued to a
> +	 * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are
> +	 * terminal: tasks on them have left BPF custody.
> +	 *
> +	 * Don't touch the flag if already set (e.g., by
> +	 * mark_direct_dispatch() or direct_dispatch()/finish_dispatch()
> +	 * for user DSQs).
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> +		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;

given that this is tied to dequeue, maybe a more direct name would be less
confusing? e.g. something like SCX_TASK_NEED_DEQ?

> @@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch,
>  
>  	p->scx.ddsp_dsq_id = dsq_id;
>  	p->scx.ddsp_enq_flags = enq_flags;
> +
> +	/*
> +	 * Mark the task as entering BPF scheduler's custody if it's being
> +	 * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This
> +	 * handles the case where ops.select_cpu() directly dispatches - even
> +	 * though ops.enqueue() won't be called, the task enters BPF custody
> +	 * if dispatched to a user DSQ and should get ops.dequeue() when it
> +	 * leaves.
> +	 *
> +	 * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag
> +	 * is clear since the BPF scheduler is done with the task.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		if (!is_terminal_dsq(dsq_id))
> +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +		else
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +	}

Hmm... I'm a bit confused on why this needs to be in mark_direct_dispatch()
AND dispatch_enqueue(). The flag should be clear when off SCX. The only
places where it could be set is from the enqueue path - when a task is
direct dispatched to a non-terminal DSQ or BPF. Both cases can be reliably
captured in do_enqueue_task(), no?

>  static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> @@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
...
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		if (!is_terminal_dsq(dsq->id)) {
> +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +		} else {
> +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
> +	}

And when would direct_dispatch() need to call ops.dequeue()?
direct_dispatch() is only used from do_enqueue_task() and there can only be
one direct dispatch attempt on any given enqueue event. A task being
enqueued shouldn't have the OPS_ENQUEUED set and would get dispatched once
to either a terminal or non-terminal DSQ. If terminal, there's nothing to
do. If non-terminal, the flag would need to be set. Am I missing something?

> @@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
...
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {

nit: () around & expression.

> +			u64 flags = deq_flags;
> +
> +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +				flags |= SCX_DEQ_SCHED_CHANGE;
> +
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*
> @@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> -		if (SCX_HAS_OP(sch, dequeue))
> -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> -					 p, deq_flags);
> +		/*
> +		 * Task is still on the BPF scheduler (not dispatched yet).
> +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> +		 * only for property changes, not for core-sched picks or
> +		 * sleep.
> +		 *
> +		 * Clear the flag after calling ops.dequeue(): the task is
> +		 * leaving BPF scheduler's custody.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue)) {
> +			u64 flags = deq_flags;
> +
> +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +				flags |= SCX_DEQ_SCHED_CHANGE;
> +
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;

I wonder whether this and the above block can be factored somehow.

> @@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  					 struct scx_dispatch_q *src_dsq,
>  					 struct rq *dst_rq)
>  {
> +	struct scx_sched *sch = scx_root;
>  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
>  
>  	/* @dsq is locked and @p is on @dst_rq */
> @@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  
>  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
>  
> +	/*
> +	 * Task is moving from a non-local DSQ to a local DSQ. Call
> +	 * ops.dequeue() if the task was in BPF custody.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +	}
> +
>  	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
>  		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
>  	else
> @@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  
>  	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
>  
> +	/*
> +	 * Handle ops.dequeue() based on destination DSQ.
> +	 *
> +	 * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF
> +	 * scheduler is done with the task. Call ops.dequeue() if it was in
> +	 * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag.
> +	 *
> +	 * Dispatch to user DSQs: task is in BPF scheduler's custody.
> +	 * Mark it so ops.dequeue() will be called when it leaves.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		if (!is_terminal_dsq(dsq_id)) {
> +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +		} else {

Let's do "if (COND) { A } else { B }" instead of "if (!COND) { B } else { A
}". Continuing from earlier, I don't understand why we'd need to set
OPS_ENQUEUED here. Given that a transition to a terminal DSQ is terminal, I
can't think of conditions where we'd need to set OPS_ENQUEUED from
ops.dispatch().

> +			/*
> +			 * Locking: we're holding the @rq lock (the
> +			 * dispatch CPU's rq), but not necessarily
> +			 * task_rq(p), since @p may be from a remote CPU.
> +			 *
> +			 * This is safe because SCX_OPSS_DISPATCHING state
> +			 * prevents racing dequeues, any concurrent
> +			 * ops_dequeue() will wait for this state to clear.
> +			 */
> +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> +
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
> +	}

I'm not sure finish_dispatch() is the right place to do this. e.g.
scx_bpf_dsq_move() can also move tasks from a user DSQ to a terminal DSQ and
the above wouldn't cover it. Wouldn't it make more sense to do this in
dispatch_enqueue()?

> @@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p)
>  
>  	lockdep_assert_rq_held(rq);
>  
> +	/*
> +	 * Clear enqueue/dequeue tracking flags when enabling the task.
> +	 * This ensures a clean state when the task enters SCX. Only needed
> +	 * if ops.dequeue() is implemented.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))
> +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +
>  	/*
>  	 * Set the weight before calling ops.enable() so that the scheduler
>  	 * doesn't see a stale value if they inspect the task struct.
> @@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p)
>  	if (SCX_HAS_OP(sch, disable))
>  		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
>  	scx_set_task_state(p, SCX_TASK_READY);
> +
> +	/*
> +	 * Clear enqueue/dequeue tracking flags when disabling the task.
> +	 * Only needed if ops.dequeue() is implemented.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))
> +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;

If we make the flag transitions consistent, we shouldn't need these, right?
We can add WARN_ON_ONCE() at the head of enqueue maybe.

Thanks.

-- 
tejun