[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DFV0FQUGMEVB.321XK903AC0B9@google.com>
Date: Thu, 22 Jan 2026 09:28:39 +0000
From: Kuba Piecuch <jpiecuch@...gle.com>
To: Andrea Righi <arighi@...dia.com>, Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>
Cc: Emil Tsalapatis <emil@...alapatis.com>, Daniel Hodges <hodgesd@...a.com>, <sched-ext@...ts.linux.dev>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
[Resending with reply-all, messed up the first time, apologies.]
Hi Andrea,
On Wed Jan 21, 2026 at 12:25 PM UTC, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change scenarios. As a result, BPF schedulers
> cannot reliably track task state.
>
> In addition, some ops.dequeue() callbacks can be skipped (e.g., during
> direct dispatch), so ops.enqueue() calls are not always paired with a
> corresponding ops.dequeue(), potentially breaking accounting logic.
>
> Fix this by guaranteeing that every ops.enqueue() is matched with a
> corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to
> distinguish dequeues triggered by scheduling property changes from those
> occurring in the normal dispatch workflow.
>
> New semantics:
> 1. ops.enqueue() is called when a task enters the BPF scheduler
> 2. ops.dequeue() is called when the task leaves the BPF scheduler,
> because it is dispatched to a DSQ (regular workflow)
> 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF
> scheduler, because a task property is changed (sched_change)
What about the case where ops.dequeue() is called due to core-sched picking the
task through sched_core_find()? If I understand core-sched correctly, it can
happen without prior dispatch, so it doesn't fit case 2, and we're not changing
task properties, so it doesn't fit case 3 either.
> + /*
> + * Set when ops.dequeue() is called after successful dispatch; used to
> + * distinguish dispatch dequeues from async dequeues (property changes)
> + * and to prevent duplicate dequeue calls.
> + */
> + SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
I see this flag being set and cleared in several places, but I don't see it
actually being read, is that intentional?
> @@ -1529,6 +1553,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>
> switch (opss & SCX_OPSS_STATE_MASK) {
> case SCX_OPSS_NONE:
> + if (SCX_HAS_OP(sch, dequeue) &&
> + p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> + bool is_async_dequeue =
> + !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC));
> +
> + if (is_async_dequeue)
> + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> + p, deq_flags | SCX_DEQ_ASYNC);
> + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> + SCX_TASK_DISPATCH_DEQUEUED);
> + }
> break;
> case SCX_OPSS_QUEUEING:
> /*
> @@ -1537,9 +1572,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> */
> BUG();
> case SCX_OPSS_QUEUED:
> - if (SCX_HAS_OP(sch, dequeue))
> + /*
> + * Task is in the enqueued state. This is a property change
> + * dequeue before dispatch completes. Notify the BPF scheduler
> + * with SCX_DEQ_ASYNC flag.
> + */
> + if (SCX_HAS_OP(sch, dequeue)) {
> SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> - p, deq_flags);
> + p, deq_flags | SCX_DEQ_ASYNC);
> + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> + SCX_TASK_DISPATCH_DEQUEUED);
> + }
>
A core-sched pick of a task queued on the BPF scheduler will result in entering
the SCX_OPSS_QUEUED case, which in turn will call ops.dequeue(SCX_DEQ_ASYNC).
This seems to conflict with the is_async_dequeue check above, which treats
SCX_DEQ_CORE_SCHED_EXEC as a synchronous dequeue.
Thanks,
Kuba
Powered by blists - more mailing lists