linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <DFV0FQUGMEVB.321XK903AC0B9@google.com>
Date: Thu, 22 Jan 2026 09:28:39 +0000
From: Kuba Piecuch <jpiecuch@...gle.com>
To: Andrea Righi <arighi@...dia.com>, Tejun Heo <tj@...nel.org>, 
	David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>
Cc: Emil Tsalapatis <emil@...alapatis.com>, Daniel Hodges <hodgesd@...a.com>, <sched-ext@...ts.linux.dev>, 
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

[Resending with reply-all, messed up the first time, apologies.]

Hi Andrea,

On Wed Jan 21, 2026 at 12:25 PM UTC, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change scenarios. As a result, BPF schedulers
> cannot reliably track task state.
>
> In addition, some ops.dequeue() callbacks can be skipped (e.g., during
> direct dispatch), so ops.enqueue() calls are not always paired with a
> corresponding ops.dequeue(), potentially breaking accounting logic.
>
> Fix this by guaranteeing that every ops.enqueue() is matched with a
> corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to
> distinguish dequeues triggered by scheduling property changes from those
> occurring in the normal dispatch workflow.
>
> New semantics:
> 1. ops.enqueue() is called when a task enters the BPF scheduler
> 2. ops.dequeue() is called when the task leaves the BPF scheduler,
>    because it is dispatched to a DSQ (regular workflow)
> 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF
>    scheduler, because a task property is changed (sched_change)

What about the case where ops.dequeue() is called due to core-sched picking the
task through sched_core_find()? If I understand core-sched correctly, it can
happen without prior dispatch, so it doesn't fit case 2, and we're not changing
task properties, so it doesn't fit case 3 either.

> +     /*
> +      * Set when ops.dequeue() is called after successful dispatch; used to
> +      * distinguish dispatch dequeues from async dequeues (property changes)
> +      * and to prevent duplicate dequeue calls.
> +      */
> +     SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,

I see this flag being set and cleared in several places, but I don't see it
actually being read, is that intentional?

> @@ -1529,6 +1553,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> 
>       switch (opss & SCX_OPSS_STATE_MASK) {
>       case SCX_OPSS_NONE:
> +             if (SCX_HAS_OP(sch, dequeue) &&
> +                 p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> +                     bool is_async_dequeue =
> +                             !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC));
> +
> +                     if (is_async_dequeue)
> +                             SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> +                                              p, deq_flags | SCX_DEQ_ASYNC);
> +                     p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> +                                       SCX_TASK_DISPATCH_DEQUEUED);
> +             }
>               break;
>       case SCX_OPSS_QUEUEING:
>               /*
> @@ -1537,9 +1572,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>                */
>               BUG();
>       case SCX_OPSS_QUEUED:
> -             if (SCX_HAS_OP(sch, dequeue))
> +             /*
> +              * Task is in the enqueued state. This is a property change
> +              * dequeue before dispatch completes. Notify the BPF scheduler
> +              * with SCX_DEQ_ASYNC flag.
> +              */
> +             if (SCX_HAS_OP(sch, dequeue)) {
>                       SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> -                                      p, deq_flags);
> +                                      p, deq_flags | SCX_DEQ_ASYNC);
> +                     p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> +                                       SCX_TASK_DISPATCH_DEQUEUED);
> +             }
> 

A core-sched pick of a task queued on the BPF scheduler will result in entering
the SCX_OPSS_QUEUED case, which in turn will call ops.dequeue(SCX_DEQ_ASYNC).
This seems to conflict with the is_async_dequeue check above, which treats
SCX_DEQ_CORE_SCHED_EXEC as a synchronous dequeue.

Thanks,
Kuba