linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXxfBnzs5Xnwm8PS@gpd4>
Date: Fri, 30 Jan 2026 08:34:30 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	Emil Tsalapatis <emil@...alapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hi Kuba,

On Tue, Jan 27, 2026 at 04:41:43PM +0000, Kuba Piecuch wrote:
...
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..ed6bf7d9e6e8c 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed.
> >  
> >     * Queue the task on the BPF side.
> >  
> > +   Once ``ops.enqueue()`` is called, the task enters the "enqueued state".
> > +   The task remains in this state until ``ops.dequeue()`` is called, which
> > +   happens in the following cases:
> > +
> > +   1. **Regular dispatch workflow**: when the task is successfully
> > +      dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()``
> > +      is triggered immediately to notify the BPF scheduler.
> > +
> > +   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> > +      core scheduling picks a task for execution before it has been
> > +      dispatched, ``ops.dequeue()`` is called with the
> > +      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> > +
> > +   3. **Scheduling property change**: when a task property changes (via
> > +      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> > +      priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called
> > +      with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> > +
> > +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> > +   regardless of whether the task is still on a BPF data structure, or it
> > +   has already been dispatched to a DSQ. This guarantees that every
> > +   ``ops.enqueue()`` will eventually be followed by a corresponding
> > +   ``ops.dequeue()``.
> 
> Not sure I follow this paragraph, specifically the first sentence
> (starting with ``ops.dequeue()`` is called ...).
> It seems to imply that a task that has already been dispatched to a DSQ still
> counts as enqueued, but the preceding text contradicts that by saying that
> a task is in an "enqueued state" from the time ops.enqueue() is called until
> (among other things) it's successfully dispatched to a DSQ.
> 
> This would make sense if this paragraph used "enqueued" in the SCX_TASK_QUEUED
> sense, while the first paragraph used the SCX_OPSS_QUEUED sense, but if that's
> the case, it's quite confusing and should be clarified IMO.

Good point, the confusion is on my side, the documentation overloads the
term "enqueued" and doesn't clearly distinguish the different contexts.

In that paragraph, "enqueued" refers to the ops lifecycle (i.e., a task for
which ops.enqueue() has been called and whose scheduler-visible state is
being tracked), not to the task being queued on a DSQ or having
SCX_TASK_QUEUED set.

The intent is to treat ops.enqueue() and ops.dequeue() as the boundaries of
a scheduler-visible lifecycle, regardless of whether the task is eventually
queued on a DSQ or dispatched directly.

And as noted by Tejun in his last email, skipping ops.dequeue() for direct
dispatches also makes sense, since in that case no new ops lifecycle is
established (direct dispatch in ops.select_cpu() or ops.enqueue() can be
seen as a shortcut to bypass the scheduler).

I'll update the patch and documentation accordingly to make this
distinction more explicit.

> 
> > +
> > +   This makes it reliable for BPF schedulers to track the enqueued state
> > +   and maintain accurate accounting.
> > +
> > +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> > +   don't need to track these transitions. The sched_ext core will safely
> > +   handle all dequeue operations regardless.
> > +
> >  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> >     empty, it then looks at the global DSQ. If there still isn't a task to
> >     run, ``ops.dispatch()`` is invoked which can use the following two
> > @@ -319,6 +350,8 @@ by a sched_ext scheduler:
> >                  /* Any usable CPU becomes available */
> >  
> >                  ops.dispatch(); /* Task is moved to a local DSQ */
> > +
> > +                ops.dequeue(); /* Exiting BPF scheduler */
> >              }
> >              ops.running();      /* Task starts running on its assigned CPU */
> >              while (task->scx.slice > 0 && task is runnable)
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..59446cd0373fa 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,8 +84,19 @@ struct scx_dispatch_q {
> >  /* scx_entity.flags */
> >  enum scx_ent_flags {
> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > +	/*
> > +	 * Set when ops.enqueue() is called; used to determine if ops.dequeue()
> > +	 * should be invoked when transitioning out of SCX_OPSS_NONE state.
> > +	 */
> > +	SCX_TASK_OPS_ENQUEUED	= 1 << 1,
> >  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> >  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
> > +	/*
> > +	 * Set when ops.dequeue() is called after successful dispatch; used to
> > +	 * distinguish dispatch dequeues from property change dequeues and
> > +	 * prevent duplicate dequeue calls.
> > +	 */
> 
> What counts as a duplicate dequeue call? Looking at the code, we can clearly
> have ops.dequeue(SCHED_CHANGE) called after ops.dequeue(0) without an
> intervening call to ops.enqueue().

Yeah SCHED_CHANGE dequeues are the exception, and it's acceptable to have
ops.dequeue(0) + ops.dequeue(SCHED_CHANGE). The idea is to catch potential
duplicate dispatch dequeues. I'll clarify this.

> 
> > +	SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
> >  
> >  	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
> >  	SCX_TASK_STATE_BITS	= 2,
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index afe28c04d5aa7..18bca2b83f5c5 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> >  
> >  	p->scx.ddsp_enq_flags |= enq_flags;
> >  
> > +	/*
> > +	 * The task is about to be dispatched. If ops.enqueue() was called,
> > +	 * notify the BPF scheduler by calling ops.dequeue().
> > +	 *
> > +	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
> > +	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
> > +	 * Mark that the dispatch dequeue has been called to distinguish
> > +	 * from property change dequeues.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> > +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > +		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
> > +	}
> > +
> >  	/*
> >  	 * We are in the enqueue path with @rq locked and pinned, and thus can't
> >  	 * double lock a remote rq and enqueue to its local DSQ. For
> > @@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> >  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> >  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
> >  
> > +	/*
> > +	 * Mark that ops.enqueue() is being called for this task.
> > +	 * Clear the dispatch dequeue flag for the new enqueue cycle.
> > +	 * Only track these flags if ops.dequeue() is implemented.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED;
> > +	}
> > +
> >  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
> >  	WARN_ON_ONCE(*ddsp_taskp);
> >  	*ddsp_taskp = p;
> > @@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		if (SCX_HAS_OP(sch, dequeue) &&
> > +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> > +			/*
> > +			 * Task was already dispatched. Only call ops.dequeue()
> > +			 * if it hasn't been called yet (check DISPATCH_DEQUEUED).
> > +			 * This can happen when:
> > +			 * 1. Core-sched picks a task that was dispatched
> > +			 * 2. Property changes occur after dispatch
> > +			 */
> > +			if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) {
> > +				/*
> > +				 * ops.dequeue() wasn't called during dispatch.
> > +				 * This shouldn't normally happen, but call it now.
> > +				 */
> 
> Should we add a warning here?

Good idea, I'll add a WARN_ON_ONCE().

> 
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > +						 p, deq_flags);
> > +			} else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) {
> > +				/*
> > +				 * This is a property change after
> > +				 * dispatch. Call ops.dequeue() again with
> > +				 * %SCX_DEQ_SCHED_CHANGE.
> > +				 */
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > +						 p, deq_flags | SCX_DEQ_SCHED_CHANGE);
> > +			}
> > +			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> > +					  SCX_TASK_DISPATCH_DEQUEUED);
> > +		}
> 
> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
> for a task at most once between it being dispatched and taken off the CPU,
> even if its properties are changed multiple times while it's on CPU.
> Is that intentional? I don't see it documented.
> 
> To illustrate, assume we have a task p that has been enqueued, dispatched, and
> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
> 
> When a property of p is changed while it runs on the CPU,
> the sequence of calls is:
>   dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
>   (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
>   set_next_task_scx(p).
> 
> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
> ops.dequeue(p, ... | SCHED_CHANGE) and clears
> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
> 
> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
> dequeue_task_scx().
> 
> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
> 
> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
> this is not a core-sched pick, but it won't do much because the ops_state is
> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
> inserted into.
> 
> 
> So, we end up in a state where any subsequent property change while the task is
> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
> unset in p->scx.flags.
> 
> I really hope I didn't mess anything up when tracing the code, but of course
> I'm happy to be corrected.

Correct. And the enqueue/dequeue balancing is preserved here. In the
scenario you describe, subsequent property changes while the task remains
running go through ENQUEUE_RESTORE, which intentionally skips
ops.enqueue(). Since no new enqueue cycle is started, there is no
corresponding ops.dequeue() to deliver either.

In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
scheduler state established by the last ops.enqueue(), not with every
individual property change. Multiple property changes while the task stays
on CPU are coalesced and the enqueue/dequeue pairing remains balanced.

I agree this distinction isn't obvious from the current documentation, I'll
clarify that SCX_DEQ_SCHED_CHANGE is edge-triggered per enqueue/run cycle,
not per property change.

Do you see any practical use case where it'd be beneficial to tie
individual ops.dequeue() calls to every property change, as opposed to the
current coalesced behavior??

Thanks,
-Andrea