linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aYRiM5leoHcWiXw3@gpd4>
Date: Thu, 5 Feb 2026 10:26:11 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
	Kuba Piecuch <jpiecuch@...gle.com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hi Tejun,

On Wed, Feb 04, 2026 at 12:14:40PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 04, 2026 at 05:05:58PM +0100, Andrea Righi wrote:
> > Currently, ops.dequeue() is only invoked when the sched_ext core knows
> > that a task resides in BPF-managed data structures, which causes it to
> > miss scheduling property change events. In addition, ops.dequeue()
> > callbacks are completely skipped when tasks are dispatched to non-local
> > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> > track task state.
> > 
> > Fix this by guaranteeing that each task entering the BPF scheduler's
> > custody triggers exactly one ops.dequeue() call when it leaves that
> > custody, whether the exit is due to a dispatch (regular or via a core
> > scheduling pick) or to a scheduling property change (e.g.
> > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> > balancing, etc.).
> > 
> > BPF scheduler custody concept: a task is considered to be in "BPF
> > scheduler's custody" when it has been queued in user-created DSQs and
> > the BPF scheduler is responsible for its lifecycle. Custody ends when
> > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> > selected by core scheduling, or removed due to a property change.
> > 
> > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> > entirely and are not in its custody. Terminal DSQs include:
> >  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> >    where tasks go directly to execution.
> >  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
> >    BPF scheduler is considered "done" with the task.
> > 
> > As a result, ops.dequeue() is not invoked for tasks dispatched to
> > terminal DSQs, as the BPF scheduler no longer retains custody of them.
> > 
> > To identify dequeues triggered by scheduling property changes, introduce
> > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> > the dequeue was caused by a scheduling property change.
> > 
> ...
> > +   **Property Change Notifications for Running Tasks**:
> > +
> > +   For tasks that have left BPF custody (running or on terminal DSQs),
> > +   property changes can be intercepted through the dedicated callbacks:
> 
> I'm not sure this section is necessary. The way it's phrased makes it sound
> like schedulers would use DEQ_SCHED_CHANGE to process property changes but
> that's not the case. Relevant property changes will be notified in whatever
> ways they're notified and a task being dequeued for SCHED_CHANGE doesn't
> necessarily mean there will be an associated property change event either.
> e.g. We don't do anything re. on sched_setnuma().

Agreed, this section is a bit misleading, DEQ_SCHED_CHANGE is an
informational flag indicating the ops.dequeue() wasn't due to dispatch,
schedulers shouldn't use it to process property changes. I'll remove it.

> 
> > @@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> >  	dsq_mod_nr(dsq, 1);
> >  	p->scx.dsq = dsq;
> >  
> > +	/*
> > +	 * Mark task as in BPF scheduler's custody if being queued to a
> > +	 * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are
> > +	 * terminal: tasks on them have left BPF custody.
> > +	 *
> > +	 * Don't touch the flag if already set (e.g., by
> > +	 * mark_direct_dispatch() or direct_dispatch()/finish_dispatch()
> > +	 * for user DSQs).
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> > +		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> 
> given that this is tied to dequeue, maybe a more direct name would be less
> confusing? e.g. something like SCX_TASK_NEED_DEQ?

Ack.

> 
> > @@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch,
> >  
> >  	p->scx.ddsp_dsq_id = dsq_id;
> >  	p->scx.ddsp_enq_flags = enq_flags;
> > +
> > +	/*
> > +	 * Mark the task as entering BPF scheduler's custody if it's being
> > +	 * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This
> > +	 * handles the case where ops.select_cpu() directly dispatches - even
> > +	 * though ops.enqueue() won't be called, the task enters BPF custody
> > +	 * if dispatched to a user DSQ and should get ops.dequeue() when it
> > +	 * leaves.
> > +	 *
> > +	 * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag
> > +	 * is clear since the BPF scheduler is done with the task.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (!is_terminal_dsq(dsq_id))
> > +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		else
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +	}
> 
> Hmm... I'm a bit confused on why this needs to be in mark_direct_dispatch()
> AND dispatch_enqueue(). The flag should be clear when off SCX. The only
> places where it could be set is from the enqueue path - when a task is
> direct dispatched to a non-terminal DSQ or BPF. Both cases can be reliably
> captured in do_enqueue_task(), no?

You're right. I was incorrectly assuming we needed this in
mark_direct_dispatch() to catch direct dispatches to user DSQs from
ops.select_cpu(), but that's not true. All paths go through
do_enqueue_task() which funnels to dispatch_enqueue(), so we can handle it
all in one place.

> 
> >  static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> > @@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> ...
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (!is_terminal_dsq(dsq->id)) {
> > +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		} else {
> > +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> > +	}
> 
> And when would direct_dispatch() need to call ops.dequeue()?
> direct_dispatch() is only used from do_enqueue_task() and there can only be
> one direct dispatch attempt on any given enqueue event. A task being
> enqueued shouldn't have the OPS_ENQUEUED set and would get dispatched once
> to either a terminal or non-terminal DSQ. If terminal, there's nothing to
> do. If non-terminal, the flag would need to be set. Am I missing something?

Nah, you're right, direct_dispatch() doesn't need to call ops.dequeue() or
manage the flag. I'll remove all the flag management from direct_dispatch()
and centralize it in dispatch_enqueue().

> 
> > @@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> ...
> > +		if (SCX_HAS_OP(sch, dequeue) &&
> > +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> 
> nit: () around & expression.
> 
> > +			u64 flags = deq_flags;
> > +
> > +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +				flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> >  		break;
> >  	case SCX_OPSS_QUEUEING:
> >  		/*
> > @@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  		 */
> >  		BUG();
> >  	case SCX_OPSS_QUEUED:
> > -		if (SCX_HAS_OP(sch, dequeue))
> > -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > -					 p, deq_flags);
> > +		/*
> > +		 * Task is still on the BPF scheduler (not dispatched yet).
> > +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> > +		 * only for property changes, not for core-sched picks or
> > +		 * sleep.
> > +		 *
> > +		 * Clear the flag after calling ops.dequeue(): the task is
> > +		 * leaving BPF scheduler's custody.
> > +		 */
> > +		if (SCX_HAS_OP(sch, dequeue)) {
> > +			u64 flags = deq_flags;
> > +
> > +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +				flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> 
> I wonder whether this and the above block can be factored somehow.

Ack, we can add a helper for this.

> 
> > @@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  					 struct scx_dispatch_q *src_dsq,
> >  					 struct rq *dst_rq)
> >  {
> > +	struct scx_sched *sch = scx_root;
> >  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
> >  
> >  	/* @dsq is locked and @p is on @dst_rq */
> > @@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  
> >  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> >  
> > +	/*
> > +	 * Task is moving from a non-local DSQ to a local DSQ. Call
> > +	 * ops.dequeue() if the task was in BPF custody.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> > +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> > +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +	}
> > +
> >  	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
> >  		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
> >  	else
> > @@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
> >  
> >  	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
> >  
> > +	/*
> > +	 * Handle ops.dequeue() based on destination DSQ.
> > +	 *
> > +	 * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF
> > +	 * scheduler is done with the task. Call ops.dequeue() if it was in
> > +	 * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag.
> > +	 *
> > +	 * Dispatch to user DSQs: task is in BPF scheduler's custody.
> > +	 * Mark it so ops.dequeue() will be called when it leaves.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (!is_terminal_dsq(dsq_id)) {
> > +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		} else {
> 
> Let's do "if (COND) { A } else { B }" instead of "if (!COND) { B } else { A
> }". Continuing from earlier, I don't understand why we'd need to set
> OPS_ENQUEUED here. Given that a transition to a terminal DSQ is terminal, I
> can't think of conditions where we'd need to set OPS_ENQUEUED from
> ops.dispatch().

Right, a task that reaches ops.dispatch() is already in QUEUED state, if
it's in a user DSQ the flag is already set from when it was enqueued, so
there's no need to set the flag in finish_dispatch().

> 
> > +			/*
> > +			 * Locking: we're holding the @rq lock (the
> > +			 * dispatch CPU's rq), but not necessarily
> > +			 * task_rq(p), since @p may be from a remote CPU.
> > +			 *
> > +			 * This is safe because SCX_OPSS_DISPATCHING state
> > +			 * prevents racing dequeues, any concurrent
> > +			 * ops_dequeue() will wait for this state to clear.
> > +			 */
> > +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > +
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> > +	}
> 
> I'm not sure finish_dispatch() is the right place to do this. e.g.
> scx_bpf_dsq_move() can also move tasks from a user DSQ to a terminal DSQ and
> the above wouldn't cover it. Wouldn't it make more sense to do this in
> dispatch_enqueue()?

Agreed.

> 
> > @@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p)
> >  
> >  	lockdep_assert_rq_held(rq);
> >  
> > +	/*
> > +	 * Clear enqueue/dequeue tracking flags when enabling the task.
> > +	 * This ensures a clean state when the task enters SCX. Only needed
> > +	 * if ops.dequeue() is implemented.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue))
> > +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +
> >  	/*
> >  	 * Set the weight before calling ops.enable() so that the scheduler
> >  	 * doesn't see a stale value if they inspect the task struct.
> > @@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p)
> >  	if (SCX_HAS_OP(sch, disable))
> >  		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
> >  	scx_set_task_state(p, SCX_TASK_READY);
> > +
> > +	/*
> > +	 * Clear enqueue/dequeue tracking flags when disabling the task.
> > +	 * Only needed if ops.dequeue() is implemented.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue))
> > +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> 
> If we make the flag transitions consistent, we shouldn't need these, right?
> We can add WARN_ON_ONCE() at the head of enqueue maybe.

Correct.

Thanks for the review! I'll post a new version.

-Andrea