linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aVKt-cmkOxPe3MI6@gpd4>
Date: Mon, 29 Dec 2025 17:36:09 +0100
From: Andrea Righi <arighi@...dia.com>
To: Emil Tsalapatis <emil@...alapatis.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hi Emil,

On Sat, Dec 27, 2025 at 10:20:06PM -0500, Emil Tsalapatis wrote:
> On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote:
> > Properly implement ops.dequeue() to ensure every ops.enqueue() is
> > balanced by a corresponding ops.dequeue() call, regardless of whether
> > the task is on a BPF data structure or already dispatched to a DSQ.
> >
> > A task is considered enqueued when it is owned by the BPF scheduler.
> > This ownership persists until the task is either dispatched (moved to a
> > local DSQ for execution) or removed from the BPF scheduler, such as when
> > it blocks waiting for an event or when its properties (for example, CPU
> > affinity or priority) are updated.
> >
> > When the task enters the BPF scheduler ops.enqueue() is invoked, when it
> > leaves BPF scheduler ownership, ops.dequeue() is invoked.
> >
> > This allows BPF schedulers to reliably track task ownership and maintain
> > accurate accounting.
> >
> > Cc: Emil Tsalapatis <emil@...alapatis.com>
> > Signed-off-by: Andrea Righi <arighi@...dia.com>
> > ---
> 
> 
> Hi Andrea,
> 
> 	This change looks reasonable to me. Some comments inline:
> 
> >  Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++
> >  include/linux/sched/ext.h             |  1 +
> >  kernel/sched/ext.c                    | 27 ++++++++++++++++++++++++++-
> >  3 files changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..3ed4be53f97da 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed.
> >  
> >     * Queue the task on the BPF side.
> >  
> > +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> > +   is owned by the BPF scheduler. Ownership is retained until the task is
> > +   either dispatched (moved to a local DSQ for execution) or dequeued
> > +   (removed from the scheduler due to a blocking event, or to modify a
> > +   property, like CPU affinity, priority, etc.). When the task leaves the
> > +   BPF scheduler ``ops.dequeue()`` is invoked.
> > +
> 
> Can we say "leaves the scx class" instead? On direct dispatch we
> technically never insert the task into the BPF scheduler.

Hm.. I agree that'd be more accurate, but it might also be slightly
misleading, as it could be interpreted as the task being moved to a
different scheduling class. How about saying "leaves the enqueued state"
instead, where enqueued means ops.enqueue() being called... I can't find a
better name for this state, like "ops_enqueued", but that's be even more
confusing. :)

> 
> > +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> > +   regardless of whether the task is still on a BPF data structure, or it
> > +   is already dispatched to a DSQ (global, local, or user DSQ)
> > +
> > +   This guarantees that every ``ops.enqueue()`` will eventually be followed
> > +   by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
> > +   track task ownership and maintain accurate accounting, such as per-DSQ
> > +   queued runtime sums.
> > +
> > +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> > +   don't need to track these transitions. The sched_ext core will safely
> > +   handle all dequeue operations regardless.
> > +
> >  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> >     empty, it then looks at the global DSQ. If there still isn't a task to
> >     run, ``ops.dispatch()`` is invoked which can use the following two
> > @@ -319,6 +339,8 @@ by a sched_ext scheduler:
> >                  /* Any usable CPU becomes available */
> >  
> >                  ops.dispatch(); /* Task is moved to a local DSQ */
> > +
> > +                ops.dequeue(); /* Exiting BPF scheduler */
> >              }
> >              ops.running();      /* Task starts running on its assigned CPU */
> >              while (task->scx.slice > 0 && task is runnable)
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..334c3692a9c62 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> >  /* scx_entity.flags */
> >  enum scx_ent_flags {
> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* ops.enqueue() was called */
> 
> Can we rename this flag? For direct dispatch we never got enqueued.
> Something like "DEQ_ON_DISPATCH" would show the purpose of the
> flag more clearly.

Good point. However, ops.dequeue() isn't only called on dispatch, it can
also be triggered when a task property is changed.

So the flag should represent the "enqueued state" in the sense that
ops.enqueue() has been called and a corresponding ops.dequeue() is
expected. This is a lifecycle state, not an indication that the task is in
any queue.

Would a more descriptive comment clarify this? Something like:

  SCX_TASK_OPS_ENQUEUED = 1 << 1, /* Task in enqueued state: ops.enqueue()
                                     called, ops.dequeue() will be called
                                     when task leaves this state. */

> 
> >  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> >  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
> >  
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 94164f2dec6dc..985d75d374385 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> >  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> >  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
> >  
> > +	/* Mark that ops.enqueue() is being called for this task */
> > +	p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +
> 
> Can we avoid setting this flag when we have no .dequeue() method?
> Otherwise it stays set forever AFAICT, even after the task has been
> sent to the runqueues.

Good catch! Definitely we don't need to set this for schedulers that don't
implement ops.dequeue().

> 
> >  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
> >  	WARN_ON_ONCE(*ddsp_taskp);
> >  	*ddsp_taskp = p;
> > @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		/*
> > +		 * Task is not currently being enqueued or queued on the BPF
> > +		 * scheduler. Check if ops.enqueue() was called for this task.
> > +		 */
> > +		if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) &&
> > +		    SCX_HAS_OP(sch, dequeue)) {
> > +			/*
> > +			 * ops.enqueue() was called and the task was dispatched.
> > +			 * Call ops.dequeue() to notify the BPF scheduler that
> > +			 * the task is leaving.
> > +			 */
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > +					 p, deq_flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> >  		break;
> >  	case SCX_OPSS_QUEUEING:
> >  		/*
> > @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  		 */
> >  		BUG();
> >  	case SCX_OPSS_QUEUED:
> > -		if (SCX_HAS_OP(sch, dequeue))
> > +		/*
> > +		 * Task is owned by the BPF scheduler. Call ops.dequeue()
> > +		 * to notify the BPF scheduler that the task is being
> > +		 * removed.
> > +		 */
> > +		if (SCX_HAS_OP(sch, dequeue)) {
> 
> Edge case, but if we have a .dequeue() method but not an .enqueue() we
> still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an 
> extra condition to be consistent with the SCX_OPSS_NONE case above?

Also good catch. Will add that.

> 
> >  			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> >  					 p, deq_flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> >  
> >  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
> >  					    SCX_OPSS_NONE))
> 

Thanks,
-Andrea