[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250903095008.162049-1-arighi@nvidia.com>
Date: Wed, 3 Sep 2025 11:33:26 +0200
From: Andrea Righi <arighi@...dia.com>
To: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>,
Joel Fernandes <joelagnelf@...dia.com>,
Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>,
Shuah Khan <shuah@...nel.org>
Cc: sched-ext@...ts.linux.dev,
bpf@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: [PATCHSET v8 sched_ext/for-6.18] Add a deadline server for sched_ext tasks
sched_ext tasks can be starved by long-running RT tasks, especially since
RT throttling was replaced by deadline servers to boost only SCHED_NORMAL
tasks.
Several users in the community have reported issues with RT stalling
sched_ext tasks. This is fairly common on distributions or environments
where applications like video compositors, audio services, etc. run as RT
tasks by default.
Example trace (showing a per-CPU kthread stalled due to the sway Wayland
compositor running as an RT task):
runnable task stall (kworker/0:0[106377] failed to run for 5.043s)
...
CPU 0 : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738
curr=sway[994] class=rt_sched_class
R kworker/0:0[106377] -5043ms
scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000
cpus=01
This is often perceived as a bug in the BPF schedulers, but in reality they
can't do much: RT tasks run outside their control and can potentially
consume 100% of the CPU bandwidth.
Fix this by adding a sched_ext deadline server as well so that sched_ext
tasks are also boosted and do not suffer starvation.
Two kselftests are also provided to verify the starvation fixes and
bandwidth allocation is correct.
This patchset is also available in the following git branch:
git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server
Changes in v8:
- Add tj's patch to de-couple balance and pick_task and avoid changing
sched/core callbacks to propagate @rf
- Simplify dl_se->dl_server check (suggested by PeterZ)
- Small coding style fixes in the kselftests
- Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/
Changes in v7:
- Rebased to Linus master
- Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/
Changes in v6:
- Added Acks to few patches
- Fixes to few nits suggested by Tejun
- Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/
Changes in v5:
- Added a kselftest (total_bw) to sched_ext to verify bandwidth values
from debugfs
- Address comment from Andrea about redundant rq clock invalidation
- Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/
Changes in v4:
- Fixed issues with hotplugged CPUs having their DL server bandwidth
altered due to loading SCX
- Fixed other issues
- Rebased on Linus master
- All sched_ext kselftests reliably pass now, also verified that the
total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches
- Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/
Changes in v3:
- Removed code duplication in debugfs. Made ext interface separate
- Fixed issue where rq_lock_irqsave was not used in the relinquish patch
- Fixed running bw accounting issue in dl_server_remove_params
- Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/
Changes in v2:
- Fixed a hang related to using rq_lock instead of rq_lock_irqsave
- Added support to remove BW of DL servers when they are switched to/from EXT
- Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
Andrea Righi (6):
sched_ext: Exit early on hotplug events during attach
sched/deadline: Add support to remove DL server's bandwidth contribution
sched/deadline: Account ext server bandwidth
sched/deadline: Allow to initialize DL server when needed
sched_ext: Selectively enable ext and fair DL servers
selftests/sched_ext: Add test for sched_ext dl_server
Joel Fernandes (9):
sched/debug: Fix updating of ppos on server write ops
sched/debug: Stop and start server based on if it was active
sched/deadline: Clear the defer params
sched/deadline: Return EBUSY if dl_bw_cpus is zero
sched: Add a server arg to dl_server_update_idle_time()
sched_ext: Add a DL server for sched_ext tasks
sched/debug: Add support to change sched_ext server params
sched/deadline: Fix DL server crash in inactive_timer callback
selftests/sched_ext: Add test for DL server total_bw consistency
Tejun Heo (1):
sched/deadline: De-couple balance and pick_task
include/linux/sched.h | 2 +
kernel/sched/core.c | 17 +-
kernel/sched/deadline.c | 152 +++++++++---
kernel/sched/debug.c | 161 ++++++++++---
kernel/sched/ext.c | 175 ++++++++++++--
kernel/sched/fair.c | 4 +-
kernel/sched/idle.c | 2 +-
kernel/sched/sched.h | 15 +-
kernel/sched/topology.c | 5 +
tools/testing/selftests/sched_ext/Makefile | 2 +
tools/testing/selftests/sched_ext/rt_stall.bpf.c | 23 ++
tools/testing/selftests/sched_ext/rt_stall.c | 214 +++++++++++++++++
tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++++++
13 files changed, 968 insertions(+), 85 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
create mode 100644 tools/testing/selftests/sched_ext/total_bw.c
Powered by blists - more mailing lists