[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260201091318.178710-1-arighi@nvidia.com>
Date: Sun, 1 Feb 2026 10:08:03 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>
Cc: Kuba Piecuch <jpiecuch@...gle.com>,
Emil Tsalapatis <emil@...alapatis.com>,
Christian Loehle <christian.loehle@....com>,
Daniel Hodges <hodgesd@...a.com>,
sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: [PATCHSET v4 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().
This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g. sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).
To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
= Open issues =
Even though a few refinements are still pending, I'm sending a new
patchset, so we can comment more effectively based on the latest
agreed-upon semantics.
Open issues that still need agreement:
- Should we trigger ops.dequeue() for tasks dispatched to SCX_DSQ_GLOBAL?
(do we treat SCX_DSQ_GLOBAL as a local DSQ or a user DSQ?). In the
current implementation, SCX_DSQ_GLOBAL is treated like a built-in user
DSQ -> ops.dequeue() invoked for tasks dispatched to SCX_DSQ_GLOBAL.
Changes in v4:
- Introduce the concept of "BPF scheduler custody"
- Do not trigger ops.dequeue() for tasks directly dispatched to local DSQs
- Trigger ops.dequeue() only once; after the task leaves BPF scheduler
custody, further dequeue events are not reported.
- Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com
Changes in v3:
- Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
- Handle core-sched dequeues (Kuba)
- Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DEQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 76 ++++++++
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 168 ++++++++++++++++-
kernel/sched/ext_internal.h | 7 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 1 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 234 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 198 ++++++++++++++++++++
10 files changed, 686 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
Powered by blists - more mailing lists