linux-kernel - Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYNnmI26oS7YNuMP@gpd4>
Date: Wed, 4 Feb 2026 16:36:56 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Christian Loehle <christian.loehle@....com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity
 changes

Hi Kuba,

On Wed, Feb 04, 2026 at 01:20:52PM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> On Tue Feb 3, 2026 at 11:06 PM UTC, Andrea Righi wrote:
> > A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
> > target CPU. However, task affinity can change between the dispatch
> > decision and its finalization in finish_dispatch(). When this happens,
> > the scheduler may attempt to dispatch a task to a CPU that is no longer
> > allowed, resulting in fatal errors such as:
> >
> >  EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
> >
> > This race exists because ops.dispatch() runs without holding the task's
> > run queue lock, allowing a concurrent set_cpus_allowed() to update
> > p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
> > then finalized using stale affinity information.
> >
> > Example timeline:
> >
> >   CPU0                                      CPU1
> >   ----                                      ----
> >                                             task_rq_lock(p)
> >   if (cpumask_test_cpu(cpu, p->cpus_ptr))
> >                                             set_cpus_allowed_scx(p, new_mask)
> >                                             task_rq_unlock(p)
> >       scx_bpf_dsq_insert(p,
> >               SCX_DSQ_LOCAL_ON | cpu, 0)
> >
> > Fix this by extending the existing qseq invalidation mechanism to also
> > cover CPU affinity changes, in addition to task dequeues/re-enqueues,
> > occurring between dispatch decision and finalization.
> 
> I'm not convinced this will prevent the exact race scenario you describe.
> For the qseq increment inside set_cpus_allowed_scx() to be noticed, it needs
> to happen _after_ scx_bpf_dsq_insert() reads the qseq and stores it in the
> dispatch buffer entry.
> 
> We can still have the cpumask change and qseq increment on CPU1 happen between
> cpumask_test_cpu() and scx_bpf_dsq_insert() on CPU0, and it will not be
> detected because the dispatch buffer entry will contain the incremented value.

Yeah, I think you're right, it makes things better, but it's still racy.

> 
> >
> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> > re-dispatched using up-to-date affinity information.
> 
> How will the scheduler know that the dispatch was dropped? Is the scheduler
> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
> on CPU1?

The idea was that, if the dispatch is dropped, we'll see another
ops.enqueue() for the task, so at least the task is not "lost" and the
BPF scheduler gets another chance what to do with it. In this case it'd be
useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
the enqueue resulted from a dropped dispatch.

> 
> >
> > Signed-off-by: Andrea Righi <arighi@...dia.com>
> > ---
> >  kernel/sched/ext.c | 58 +++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 47 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 0ab994180f655..6128a2529a7c7 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -1828,8 +1828,9 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
> >   * will change. As @p's task_rq is locked, this function doesn't need to use the
> >   * holding_cpu mechanism.
> >   *
> > - * On return, @src_dsq is unlocked and only @p's new task_rq, which is the
> > - * return value, is locked.
> > + * On success, @src_dsq is unlocked and only @p's new task_rq, which is the
> > + * return value, is locked. On failure (affinity change invalidated the move),
> > + * returns NULL with @src_dsq unlocked and task remaining in @src_dsq.
> >   */
> >  static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  					 struct task_struct *p, u64 enq_flags,
> > @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  	if (dst_dsq->id == SCX_DSQ_LOCAL) {
> >  		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
> >  		if (src_rq != dst_rq &&
> > -		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -			dst_dsq = find_global_dsq(sch, p);
> > -			dst_rq = src_rq;
> > +		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +			/*
> > +			 * Task affinity changed after dispatch decision:
> > +			 * drop the dispatch, caller will handle returning
> > +			 * the task to its original DSQ.
> > +			 */
> > +			return NULL;
> >  		}
> >  	} else {
> >  		/* no need to migrate if destination is a non-local DSQ */
> > @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  	}
> >  
> >  	if (src_rq != dst_rq &&
> > -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> > -				 enq_flags | SCX_ENQ_CLEAR_OPSS);
> > +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +		/*
> > +		 * Task affinity changed after dispatch decision: drop the
> > +		 * dispatch, task remains in its current state and will be
> > +		 * dispatched again in a future cycle.
> > +		 */
> > +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
> > +					(atomic_long_read(&p->scx.ops_state) &
> > +					 SCX_OPSS_QSEQ_MASK));
> >  		return;
> >  	}
> >  
> > @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  				 struct affinity_context *ac)
> >  {
> >  	struct scx_sched *sch = scx_root;
> > +	struct rq *rq = task_rq(p);
> > +
> > +	lockdep_assert_rq_held(rq);
> >  
> >  	set_cpus_allowed_common(p, ac);
> >  
> >  	if (unlikely(!sch))
> >  		return;
> >  
> > +	/*
> > +	 * Affinity changes invalidate any pending dispatch decisions made
> > +	 * with the old affinity. Increment the runqueue's ops_qseq and
> > +	 * update the task's qseq to invalidate in-flight dispatches.
> > +	 */
> > +	if (p->scx.flags & SCX_TASK_QUEUED) {
> > +		unsigned long opss;
> > +
> > +		rq->scx.ops_qseq++;
> > +		opss = atomic_long_read(&p->scx.ops_state);
> > +		atomic_long_set(&p->scx.ops_state,
> > +				(opss & SCX_OPSS_STATE_MASK) |
> > +				(rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
> > +	}
> 
> Will we ever enter this if statement? IIUC set_cpus_allowed_scx() is always
> called under `scoped_guard (sched_change, p, DEQUEUE_SAVE)`, so assuming the
> task is on a runqueue, set_cpus_allowed_scx() will always be preceded by
> dequeue_task_scx(), which clears SCX_TASK_QUEUED.

I think you're right, we can probably remove this.

> 
> > +
> >  	/*
> >  	 * The effective cpumask is stored in @p->cpus_ptr which may temporarily
> >  	 * differ from the configured one in @p->cpus_mask. Always tell the bpf
> > @@ -6013,14 +6042,21 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
> >  
> >  	/* execute move */
> >  	locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq);
> > -	dispatched = true;
> > +	if (locked_rq) {
> > +		dispatched = true;
> > +	} else {
> > +		/* Move failed: task stays in src_dsq */
> > +		raw_spin_unlock(&src_dsq->lock);
> > +		locked_rq = in_balance ? this_rq : NULL;
> > +	}
> >  out:
> >  	if (in_balance) {
> >  		if (this_rq != locked_rq) {
> > -			raw_spin_rq_unlock(locked_rq);
> > +			if (locked_rq)
> > +				raw_spin_rq_unlock(locked_rq);
> >  			raw_spin_rq_lock(this_rq);
> >  		}
> > -	} else {
> > +	} else if (locked_rq) {
> >  		raw_spin_rq_unlock_irqrestore(locked_rq, flags);
> >  	}
> >  
> 
> Thanks,
> Kuba
> 

Thanks,
-Andrea