linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYzc7ZVkSrtdeHnA@slm.duckdns.org>
Date: Wed, 11 Feb 2026 09:47:57 -1000
From: Tejun Heo <tj@...nel.org>
To: Andrea Righi <arighi@...dia.com>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
	Kuba Piecuch <jpiecuch@...gle.com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hello,

On Wed, Feb 11, 2026 at 05:06:20PM +0100, Andrea Righi wrote:
...
> > Please use () do clarify ordering between & and &&. It's just visually
> > confusing. I wonder whether it'd be cleaner to make it take @dsq instead of
> > @dsq_id and then it can just do:
> > 
> >         return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL;
> > 
> > because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ
> > id, and the above code positively identifies what's terminal.
> 
> Ok, but we also need to include SCX_DSQ_BYPASS, in that case maybe checking
> SCX_DSQ_FLAG_BUILTIN is more generic?

Ah, forgot about that. Hmm... we can do:

        switch (dsq->id) {
                case SCX_DSQ_LOCAL:
                case SCX_DSQ_GLOBAL:
                case SCX_DSQ_BYPASS:
                        return true;
                default:
                        return false;
        }

I just feel iffy about not being specific. Easier to make mistakes in the
future and more difficult to notice after doing so, but I think this point
is kinda moot. If we break up LOCAL and GLOBAL/BYPASS handling into separate
paths in dispatch_enqueue(), we won't need this function anyway.

> > > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> > >  
> > >  	switch (opss & SCX_OPSS_STATE_MASK) {
> > >  	case SCX_OPSS_NONE:
> > > +		/*
> > > +		 * If the task is still in BPF scheduler's custody
> > > +		 * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
> > > +		 */
> > > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > > +			call_task_dequeue(sch, rq, p, deq_flags, true);
> > 
> > Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be
> > responsible for clearing IN_CUSTODY too?
> 
> The path that clears OPSS to NONE doesn't always clear IN_CUSTODY: in
> dispatch_to_local_dsq(), when we're moving a task that was in DISPATCHING
> to a remote CPU's local DSQ, we only set ops_state to NONE, so a concurrent
> dequeue can proceed, but we only clear IN_CUSTODY when we later enqueue or
> move the task. So we can see NONE + IN_CUSTODY here and need to handle it.
> And we can't clear IN_CUSTODY at the same time we set NONE there, because
> we don't hold the task's rq lock yet and we can't trigger ops.dequeue().

I see. Can you please add a comment with the above?

...
> > I think a better place to put this would be inside local_dsq_post_enq() so
> > that dispatch_enqueue() and move_local_task_to_local_dsq() can share the
> > path. This would mean breaking out local and global cases in
> > dispatch_enqueue(). ie. at the end of dispatch_enqueue():
> > 
> >         if (is_local) {
> >                 local_dsq_post_enq(...);
> >         } else {
> >                 if (dsq->id == SCX_DSQ_GLOBAL)
> >                         global_dsq_post_enq(...);       /* or open code with comment */
> >                 raw_spin_unlock(&dsq->lock);
> >         }
> 
> Agreed, I'll move this into local_dsq_post_enq() and introduce
> a global_dsq_post_enq().

Yeah, and as you pointed out, BYPASS.

> > > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
> > > +			       struct task_struct *p,
> > > +			       struct scx_dispatch_q *dsq, struct rq *src_rq)
> > >  {
> > >  	raw_spin_rq_unlock(this_rq);
> > >  
> > >  	if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
> > > +		/*
> > > +		 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> > > +		 * Call ops.dequeue() if the task was in BPF custody.
> > > +		 */
> > > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > > +			call_task_dequeue(sch, src_rq, p, 0, false);
> > 
> > and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates
> > and reactivates the task. The deactivation invokes ops_dequeue() but that
> > should suppress dequeue invocation as that's internal transfer (this is
> > discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it
> > gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should
> > trigger dequeue invocation, right?
> 
> Should we trigger ops.dequeue() when the task is dequeued inside
> move_remote_task_to_local_dsq() (in ops_dequeue() on the path triggered by
> deactivate_task() there) instead of suppressing it and invoking on the
> target in local_dsq_post_enq()?
> 
> That way the BPF sees dequeue on the source and then enqueue on the target,
> we avoid special-casing SCX_TASK_IN_CUSTODY in do_enqueue_task() and the
> "when to call dequeue" logic stays consistent in ops_dequeue and the
> terminal local/global post_enq paths.
> 
> Does it make sense or would you rather suppress it and only invoke on the
> target when the task lands on the local DSQ??

The end result is about the same because whenever we migrate we're sending
it to the local DSQ of the destination CPU, so whether we generate the event
on deactivation of the source CPU or activation on the destination doesn't
make *whole* lot of difference. However, conceptually, migrations are
internal events. There isn't anything actionable for the BPF scheduler. The
reason why ops.dequeue() should be emitted is not because the task is
changing CPUs (which caused the deactivation) but the fact that it ends up
in a local DSQ afterwards. I think it'll be cleaner both conceptually and
code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
paths.

Thanks.

-- 
tejun