linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <DFAX1LUAMTP9.J043T7TQQKXX@etsalapatis.com>
Date: Mon, 29 Dec 2025 13:35:48 -0500
From: "Emil Tsalapatis" <emil@...alapatis.com>
To: "Andrea Righi" <arighi@...dia.com>
Cc: "Tejun Heo" <tj@...nel.org>, "David Vernet" <void@...ifault.com>,
 "Changwoo Min" <changwoo@...lia.com>, "Daniel Hodges" <hodgesd@...a.com>,
 <sched-ext@...ts.linux.dev>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

On Mon Dec 29, 2025 at 11:36 AM EST, Andrea Righi wrote:
> Hi Emil,
>
> On Sat, Dec 27, 2025 at 10:20:06PM -0500, Emil Tsalapatis wrote:
>> On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote:
>> > Properly implement ops.dequeue() to ensure every ops.enqueue() is
>> > balanced by a corresponding ops.dequeue() call, regardless of whether
>> > the task is on a BPF data structure or already dispatched to a DSQ.
>> >
>> > A task is considered enqueued when it is owned by the BPF scheduler.
>> > This ownership persists until the task is either dispatched (moved to a
>> > local DSQ for execution) or removed from the BPF scheduler, such as when
>> > it blocks waiting for an event or when its properties (for example, CPU
>> > affinity or priority) are updated.
>> >
>> > When the task enters the BPF scheduler ops.enqueue() is invoked, when it
>> > leaves BPF scheduler ownership, ops.dequeue() is invoked.
>> >
>> > This allows BPF schedulers to reliably track task ownership and maintain
>> > accurate accounting.
>> >
>> > Cc: Emil Tsalapatis <emil@...alapatis.com>
>> > Signed-off-by: Andrea Righi <arighi@...dia.com>
>> > ---
>> 
>> 
>> Hi Andrea,
>> 
>> 	This change looks reasonable to me. Some comments inline:
>> 
>> >  Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++
>> >  include/linux/sched/ext.h             |  1 +
>> >  kernel/sched/ext.c                    | 27 ++++++++++++++++++++++++++-
>> >  3 files changed, 49 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
>> > index 404fe6126a769..3ed4be53f97da 100644
>> > --- a/Documentation/scheduler/sched-ext.rst
>> > +++ b/Documentation/scheduler/sched-ext.rst
>> > @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed.
>> >  
>> >     * Queue the task on the BPF side.
>> >  
>> > +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
>> > +   is owned by the BPF scheduler. Ownership is retained until the task is
>> > +   either dispatched (moved to a local DSQ for execution) or dequeued
>> > +   (removed from the scheduler due to a blocking event, or to modify a
>> > +   property, like CPU affinity, priority, etc.). When the task leaves the
>> > +   BPF scheduler ``ops.dequeue()`` is invoked.
>> > +
>> 
>> Can we say "leaves the scx class" instead? On direct dispatch we
>> technically never insert the task into the BPF scheduler.
>
> Hm.. I agree that'd be more accurate, but it might also be slightly
> misleading, as it could be interpreted as the task being moved to a
> different scheduling class. How about saying "leaves the enqueued state"
> instead, where enqueued means ops.enqueue() being called... I can't find a
> better name for this state, like "ops_enqueued", but that's be even more
> confusing. :)
>

I like "leaves the enqueued state", it implies that the task has no
state in the scx scheduler.

>> 
>> > +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
>> > +   regardless of whether the task is still on a BPF data structure, or it
>> > +   is already dispatched to a DSQ (global, local, or user DSQ)
>> > +
>> > +   This guarantees that every ``ops.enqueue()`` will eventually be followed
>> > +   by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
>> > +   track task ownership and maintain accurate accounting, such as per-DSQ
>> > +   queued runtime sums.
>> > +
>> > +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
>> > +   don't need to track these transitions. The sched_ext core will safely
>> > +   handle all dequeue operations regardless.
>> > +
>> >  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
>> >     empty, it then looks at the global DSQ. If there still isn't a task to
>> >     run, ``ops.dispatch()`` is invoked which can use the following two
>> > @@ -319,6 +339,8 @@ by a sched_ext scheduler:
>> >                  /* Any usable CPU becomes available */
>> >  
>> >                  ops.dispatch(); /* Task is moved to a local DSQ */
>> > +
>> > +                ops.dequeue(); /* Exiting BPF scheduler */
>> >              }
>> >              ops.running();      /* Task starts running on its assigned CPU */
>> >              while (task->scx.slice > 0 && task is runnable)
>> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
>> > index bcb962d5ee7d8..334c3692a9c62 100644
>> > --- a/include/linux/sched/ext.h
>> > +++ b/include/linux/sched/ext.h
>> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>> >  /* scx_entity.flags */
>> >  enum scx_ent_flags {
>> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
>> > +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* ops.enqueue() was called */
>> 
>> Can we rename this flag? For direct dispatch we never got enqueued.
>> Something like "DEQ_ON_DISPATCH" would show the purpose of the
>> flag more clearly.
>
> Good point. However, ops.dequeue() isn't only called on dispatch, it can
> also be triggered when a task property is changed.
>
> So the flag should represent the "enqueued state" in the sense that
> ops.enqueue() has been called and a corresponding ops.dequeue() is
> expected. This is a lifecycle state, not an indication that the task is in
> any queue.
>
> Would a more descriptive comment clarify this? Something like:
>
>   SCX_TASK_OPS_ENQUEUED = 1 << 1, /* Task in enqueued state: ops.enqueue()
>                                      called, ops.dequeue() will be called
>                                      when task leaves this state. */
>

That makes sense, my reasoning was that we actually use the flag for
is not whether the task is enqueued, but rather whether whether we 
need to call the dequeue callback when dequeueing from the SCX_OPSS_NONE 
state. Can the comment maybe more concretely explain this?

As an aside, I think this change makes it so we can remove the _OPSS_ state 
machine with some more refactoring. 

>> 
>> >  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
>> >  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
>> >  
>> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> > index 94164f2dec6dc..985d75d374385 100644
>> > --- a/kernel/sched/ext.c
>> > +++ b/kernel/sched/ext.c
>> > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>> >  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
>> >  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
>> >  
>> > +	/* Mark that ops.enqueue() is being called for this task */
>> > +	p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
>> > +
>> 
>> Can we avoid setting this flag when we have no .dequeue() method?
>> Otherwise it stays set forever AFAICT, even after the task has been
>> sent to the runqueues.
>
> Good catch! Definitely we don't need to set this for schedulers that don't
> implement ops.dequeue().
>
>> 
>> >  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
>> >  	WARN_ON_ONCE(*ddsp_taskp);
>> >  	*ddsp_taskp = p;
>> > @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>> >  
>> >  	switch (opss & SCX_OPSS_STATE_MASK) {
>> >  	case SCX_OPSS_NONE:
>> > +		/*
>> > +		 * Task is not currently being enqueued or queued on the BPF
>> > +		 * scheduler. Check if ops.enqueue() was called for this task.
>> > +		 */
>> > +		if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) &&
>> > +		    SCX_HAS_OP(sch, dequeue)) {
>> > +			/*
>> > +			 * ops.enqueue() was called and the task was dispatched.
>> > +			 * Call ops.dequeue() to notify the BPF scheduler that
>> > +			 * the task is leaving.
>> > +			 */
>> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
>> > +					 p, deq_flags);
>> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
>> > +		}
>> >  		break;
>> >  	case SCX_OPSS_QUEUEING:
>> >  		/*
>> > @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>> >  		 */
>> >  		BUG();
>> >  	case SCX_OPSS_QUEUED:
>> > -		if (SCX_HAS_OP(sch, dequeue))
>> > +		/*
>> > +		 * Task is owned by the BPF scheduler. Call ops.dequeue()
>> > +		 * to notify the BPF scheduler that the task is being
>> > +		 * removed.
>> > +		 */
>> > +		if (SCX_HAS_OP(sch, dequeue)) {
>> 
>> Edge case, but if we have a .dequeue() method but not an .enqueue() we
>> still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an 
>> extra condition to be consistent with the SCX_OPSS_NONE case above?
>
> Also good catch. Will add that.
>
>> 
>> >  			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
>> >  					 p, deq_flags);
>> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
>> > +		}
>> >  
>> >  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>> >  					    SCX_OPSS_NONE))
>> 
>
> Thanks,
> -Andrea