[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260204160710.1475802-1-arighi@nvidia.com>
Date: Wed, 4 Feb 2026 17:05:57 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>
Cc: Kuba Piecuch <jpiecuch@...gle.com>,
Emil Tsalapatis <emil@...alapatis.com>,
Christian Loehle <christian.loehle@....com>,
Daniel Hodges <hodgesd@...a.com>,
sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: [PATCHSET v5] sched_ext: Fix ops.dequeue() semantics
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().
This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g. sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).
To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
Changes in v5:
- Introduce the concept of "terminal DSQ" (when a task is dispatched to a
terminal DSQ, the task leaves the BPF scheduler's custody)
- Consider SCX_DSQ_GLOBAL as a terminal DSQ
- Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com
Changes in v4:
- Introduce the concept of "BPF scheduler custody"
- Do not trigger ops.dequeue() for direct dispatches to local DSQs
- Trigger ops.dequeue() only once; after the task leaves BPF scheduler
custody, further dequeue events are not reported.
- Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com
Changes in v3:
- Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
- Handle core-sched dequeues (Kuba)
- Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DEQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 74 +++++++
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 186 +++++++++++++++-
kernel/sched/ext_internal.h | 7 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 1 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 269 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 207 ++++++++++++++++++
10 files changed, 746 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
Powered by blists - more mailing lists