linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aYUMeueUpqakv8lR@gpd4>
Date: Thu, 5 Feb 2026 22:32:42 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hi Kuba,

On Thu, Feb 05, 2026 at 07:29:42PM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> On Thu Feb 5, 2026 at 3:32 PM UTC, Andrea Righi wrote:
> > Currently, ops.dequeue() is only invoked when the sched_ext core knows
> > that a task resides in BPF-managed data structures, which causes it to
> > miss scheduling property change events. In addition, ops.dequeue()
> > callbacks are completely skipped when tasks are dispatched to non-local
> > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> > track task state.
> >
> > Fix this by guaranteeing that each task entering the BPF scheduler's
> > custody triggers exactly one ops.dequeue() call when it leaves that
> > custody, whether the exit is due to a dispatch (regular or via a core
> > scheduling pick) or to a scheduling property change (e.g.
> > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> > balancing, etc.).
> >
> > BPF scheduler custody concept: a task is considered to be in "BPF
> > scheduler's custody" when it has been queued in user-created DSQs and
> > the BPF scheduler is responsible for its lifecycle. Custody ends when
> > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> > selected by core scheduling, or removed due to a property change.
> 
> Strictly speaking, a task in BPF scheduler custody doesn't have to be queued
> in a user-created DSQ. It could just reside on some custom data structure.

Yeah... we definitely need to consider internal BPF queues.

> 
> >
> > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> > entirely and are not in its custody. Terminal DSQs include:
> >  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> >    where tasks go directly to execution.
> >  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
> >    BPF scheduler is considered "done" with the task.
> >
> > As a result, ops.dequeue() is not invoked for tasks dispatched to
> > terminal DSQs, as the BPF scheduler no longer retains custody of them.
> 
> Shouldn't it be "directly dispatched to terminal DSQs"?

Ack.

> 
> >
> > To identify dequeues triggered by scheduling property changes, introduce
> > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> > the dequeue was caused by a scheduling property change.
> >
> > New ops.dequeue() semantics:
> >  - ops.dequeue() is invoked exactly once when the task leaves the BPF
> >    scheduler's custody, in one of the following cases:
> >    a) regular dispatch: a task dispatched to a user DSQ is moved to a
> >       terminal DSQ (ops.dequeue() called without any special flags set),
> 
> I don't think the task has to be on a user DSQ. How about just "a task in BPF
> scheduler's custody is dispatched to a terminal DSQ from ops.dispatch()"?

Right.

> 
> >    b) core scheduling dispatch: core-sched picks task before dispatch,
> >       ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set,
> >    c) property change: task properties modified before dispatch,
> >       ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set.
> >
> > This allows BPF schedulers to:
> >  - reliably track task ownership and lifecycle,
> >  - maintain accurate accounting of managed tasks,
> >  - update internal state when tasks change properties.
> >
> ...
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..ccd1fad3b3b92 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,57 @@ The following briefly shows how a waking task is scheduled and executed.
> >  
> >     * Queue the task on the BPF side.
> >  
> > +   **Task State Tracking and ops.dequeue() Semantics**
> > +
> > +   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> > +   enter the "BPF scheduler's custody" depending on where it's dispatched:
> > +
> > +   * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``,
> > +     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
> > +     is done with the task - it either goes straight to a CPU's local run
> > +     queue or to the global DSQ as a fallback. The task never enters (or
> > +     exits) BPF custody, and ``ops.dequeue()`` will not be called.
> > +
> > +   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
> > +     BPF scheduler's custody. When the task later leaves BPF custody
> > +     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
> > +     sleep/property changes), ``ops.dequeue()`` will be called exactly once.
> > +
> > +   * **Queued on BPF side**: The task is in BPF data structures and in BPF
> > +     custody, ``ops.dequeue()`` will be called when it leaves.
> > +
> > +   The key principle: **ops.dequeue() is called when a task leaves the BPF
> > +   scheduler's custody**.
> > +
> > +   This works also with the ``ops.select_cpu()`` direct dispatch
> > +   optimization: even though it skips ``ops.enqueue()`` invocation, if the
> > +   task is dispatched to a user-created DSQ, it enters BPF custody and will
> > +   get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ,
> > +   the BPF scheduler is done with it immediately. This provides the
> > +   performance benefit of avoiding the ``ops.enqueue()`` roundtrip while
> > +   maintaining correct state tracking.
> > +
> > +   The dequeue can happen for different reasons, distinguished by flags:
> > +
> > +   1. **Regular dispatch workflow**: when the task is dispatched from a
> > +      user-created DSQ to a terminal DSQ (leaving BPF custody for execution),
> > +      ``ops.dequeue()`` is triggered without any special flags.
> 
> There's no requirement for the task do be on a user-created DSQ.

Ditto.

> 
> > +
> > +   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> > +      core scheduling picks a task for execution while it's still in BPF
> > +      custody, ``ops.dequeue()`` is called with the
> > +      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> > +
> > +   3. **Scheduling property change**: when a task property changes (via
> > +      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> > +      priority changes, CPU migrations, etc.) while the task is still in
> > +      BPF custody, ``ops.dequeue()`` is called with the
> > +      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> > +
> > +   **Important**: Once a task has left BPF custody (dispatched to a
> > +   terminal DSQ), property changes will not trigger ``ops.dequeue()``,
> > +   since the task is no longer being managed by the BPF scheduler.
> > +
> >  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> >     empty, it then looks at the global DSQ. If there still isn't a task to
> >     run, ``ops.dispatch()`` is invoked which can use the following two
> ...
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..35a88942810b4 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> >  /* scx_entity.flags */
> >  enum scx_ent_flags {
> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > +	SCX_TASK_NEED_DEQ	= 1 << 1, /* task needs ops.dequeue() */
> 
> I think this could use a comment that connects this flag to the concept of
> BPF custody, so how about something like "task is in BPF custody, needs
> ops.dequeue() when leaving it"?

Ack.

> 
> >  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> >  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
> >  
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 0bb8fa927e9e9..9ebca357196b4 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> ...
> > @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> >       dsq_mod_nr(dsq, 1);
> >       p->scx.dsq = dsq;
> >
> > +     /*
> > +      * Handle ops.dequeue() and custody tracking.
> > +      *
> > +      * Builtin DSQs (local, global, bypass) are terminal: the BPF
> > +      * scheduler is done with the task. If it was in BPF custody, call
> > +      * ops.dequeue() and clear the flag.
> > +      *
> > +      * User DSQs: Task is in BPF scheduler's custody. Set the flag so
> > +      * ops.dequeue() will be called when it leaves.
> > +      */
> > +     if (SCX_HAS_OP(sch, dequeue)) {
> > +             if (is_terminal_dsq(dsq->id)) {
> > +                     if (p->scx.flags & SCX_TASK_NEED_DEQ)
> > +                             SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> > +                                              rq, p, 0);
> > +                     p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> > +             } else {
> > +                     p->scx.flags |= SCX_TASK_NEED_DEQ;
> > +             }
> > +     }
> > +
> 
> This is the only place where I see SCX_TASK_NEED_DEQ being set, which means
> it won't be set if the enqueued task is queued on the BPF scheduler's internal
> data structures rather than dispatched to a user-created DSQ. I don't think
> that's the behavior we're aiming for.

Right, I'll implement the right behavior (calling ops.dequeue()) for tasks
stored in internal BPF queues.

> 
> > @@ -1524,6 +1579,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		/*
> > +		 * Task is not in BPF data structures (either dispatched to
> > +		 * a DSQ or running). Only call ops.dequeue() if the task
> > +		 * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ
> > +		 * is set).
> > +		 *
> > +		 * If the task has already been dispatched to a terminal
> > +		 * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF
> > +		 * scheduler's custody and the flag will be clear, so we
> > +		 * skip ops.dequeue().
> > +		 *
> > +		 * If this is a property change (not sleep/core-sched) and
> > +		 * the task is still in BPF custody, set the
> > +		 * %SCX_DEQ_SCHED_CHANGE flag.
> > +		 */
> > +		if (SCX_HAS_OP(sch, dequeue) &&
> > +		    (p->scx.flags & SCX_TASK_NEED_DEQ))
> > +			call_task_dequeue(sch, rq, p, deq_flags);
> >  		break;
> >  	case SCX_OPSS_QUEUEING:
> >  		/*
> > @@ -1532,9 +1605,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  		 */
> >  		BUG();
> >  	case SCX_OPSS_QUEUED:
> > +		/*
> > +		 * Task is still on the BPF scheduler (not dispatched yet).
> > +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> > +		 * only for property changes, not for core-sched picks or
> > +		 * sleep.
> > +		 */
> 
> The part of the comment about SCX_DEQ_SCHED_CHANGE looks like it belongs in
> call_task_dequeue(), not here.

Ack.

> 
> >  		if (SCX_HAS_OP(sch, dequeue))
> > -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > -					 p, deq_flags);
> > +			call_task_dequeue(sch, rq, p, deq_flags);
> 
> How about adding WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ)) here or in
> call_task_dequeue()?

Ack.

Thanks for the review!

-Andrea