linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYMbZcRNR5AUNiUt@gpd4>
Date: Wed, 4 Feb 2026 11:11:49 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

On Mon, Feb 02, 2026 at 11:56:43AM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> Looks good overall, but we need to settle on the global DSQ semantics, plus
> some edge cases that need clearing up.

On this one I think we settled on the assumption that SCX_DSQ_GLOBAL can be
considered a "terminal DSQ", so we won't trigger ops.dequeue().

> 
> On Sun Feb 1, 2026 at 9:08 AM UTC, Andrea Righi wrote:
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..6d9e82e6ca9d4 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed.
> >  
> >     * Queue the task on the BPF side.
> >  
> > +   **Task State Tracking and ops.dequeue() Semantics**
> > +
> > +   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> > +   enter the "BPF scheduler's custody" depending on where it's dispatched:
> > +
> > +   * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or
> > +     ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler
> > +     entirely and goes straight to the CPU's local run queue. The task
> > +     never enters BPF custody, and ``ops.dequeue()`` will not be called.
> > +
> > +   * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs):
> > +     the task enters the BPF scheduler's custody. When the task later
> > +     leaves BPF custody (dispatched to a local DSQ, picked by core-sched,
> > +     or dequeued for sleep/property changes), ``ops.dequeue()`` will be
> > +     called exactly once.
> > +
> > +   * **Queued on BPF side**: The task is in BPF data structures and in BPF
> > +     custody, ``ops.dequeue()`` will be called when it leaves.
> > +
> > +   The key principle: **ops.dequeue() is called when a task leaves the BPF
> > +   scheduler's custody**. A task is in BPF custody if it's on a non-local
> > +   DSQ or in BPF data structures. Once dispatched to a local DSQ or after
> > +   ops.dequeue() is called, the task is out of BPF custody and the BPF
> > +   scheduler no longer needs to track it.
> > +
> > +   This works correctly with the ``ops.select_cpu()`` direct dispatch
> > +   optimization: even though it skips ``ops.enqueue()`` invocation, if the
> > +   task is dispatched to a non-local DSQ, it enters BPF custody and will
> > +   get ``ops.dequeue()`` when it leaves. This provides the performance
> > +   benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining
> > +   correct state tracking.
> > +
> > +   The dequeue can happen for different reasons, distinguished by flags:
> > +
> > +   1. **Regular dispatch workflow**: when the task is dispatched from a
> > +      non-local DSQ to a local DSQ (leaving BPF custody for execution),
> > +      ``ops.dequeue()`` is triggered without any special flags.
> 
> Maybe add a note that this can happen asynchronously, without the BPF
> scheduler explicitly dispatching the task to a local DSQ, when the task
> is on a global DSQ? Or maybe make that case into a separate dequeue reason
> with its own flag, e.g. SCX_DEQ_PICKED_FROM_GLOBAL_DSQ?

And I guess we don't need this if we consider SCX_DSQ_GLOBAL as a terminal
DSQ, because we won't trigger ops.dequeue().

> 
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..0d003d2845393 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> >  /* scx_entity.flags */
> >  enum scx_ent_flags {
> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* under ext scheduler's custody */
> 
> Nit: I think "in BPF scheduler's custody" would be a bit clearer, as
> "ext scheduler" could potentially be interpreted to mean SCHED_CLASS_EXT
> as a whole.

Ack. Will change that.

> 
> > @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		/*
> > +		 * Task is not in BPF data structures (either dispatched to
> > +		 * a DSQ or running). Only call ops.dequeue() if the task
> > +		 * is still in BPF scheduler's custody
> > +		 * (%SCX_TASK_OPS_ENQUEUED is set).
> > +		 *
> > +		 * If the task has already been dispatched to a local DSQ
> > +		 * (left BPF custody), the flag will be clear and we skip
> > +		 * ops.dequeue()
> > +		 *
> > +		 * If this is a property change (not sleep/core-sched) and
> > +		 * the task is still in BPF custody, set the
> > +		 * %SCX_DEQ_SCHED_CHANGE flag.
> > +		 */
> > +		if (SCX_HAS_OP(sch, dequeue) &&
> > +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> > +			u64 flags = deq_flags;
> > +
> > +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +				flags |= SCX_DEQ_SCHED_CHANGE;
> 
> I think this logic will result in ops.dequeue(SCHED_CHANGE) being called for
> tasks being picked from a global DSQ being migrated from a remote rq to the
> local rq, which, while technically correct since the task is migrating rqs,
> may be confusing, since it fits two cases in the documentation:
> 
> * Since the task is leaving BPF custody for execution, ops.dequeue() should be
>   called without any special flags.
> * Since the task is being migrated between rqs, ops.dequeue() should be called
>   with SCX_DEQ_SCHED_CHANGE.

This also should be fixed with the new logic, because a task disptched to a
global DSQ is considered outside of the BPF scheduler's custody, so
ops.dequeue() is not invoked at all.

I'll post a new patch set later today, so we can better discuss if all
these assumptions have been addressed properly. :)

> 
> > +
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> >  		break;
> >  	case SCX_OPSS_QUEUEING:
> >  		/*
> 
> Thanks,
> Kuba

Thanks,
-Andrea