[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c978e3ed-054f-4849-a4ff-d0fba07e3c19@arm.com>
Date: Thu, 30 Oct 2025 17:00:24 +0000
From: Christian Loehle <christian.loehle@....com>
To: Andrea Righi <arighi@...dia.com>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
 Changwoo Min <changwoo@...lia.com>, Shuah Khan <shuah@...nel.org>,
 Joel Fernandes <joelagnelf@...dia.com>,
 Emil Tsalapatis <emil@...alapatis.com>,
 Luigi De Matteis <ldematteis123@...il.com>, sched-ext@...ts.linux.dev,
 bpf@...r.kernel.org, linux-kselftest@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCHSET v10 sched_ext/for-6.19] Add a deadline server for
 sched_ext tasks
On 10/29/25 19:08, Andrea Righi wrote:
> sched_ext tasks can be starved by long-running RT tasks, especially since
> RT throttling was replaced by deadline servers to boost only SCHED_NORMAL
> tasks.
> 
> Several users in the community have reported issues with RT stalling
> sched_ext tasks. This is fairly common on distributions or environments
> where applications like video compositors, audio services, etc. run as RT
> tasks by default.
> 
> Example trace (showing a per-CPU kthread stalled due to the sway Wayland
> compositor running as an RT task):
> 
>  runnable task stall (kworker/0:0[106377] failed to run for 5.043s)
>  ...
>  CPU 0   : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738
>            curr=sway[994] class=rt_sched_class
>    R kworker/0:0[106377] -5043ms
>        scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
>        sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000
>        cpus=01
> 
> This is often perceived as a bug in the BPF schedulers, but in reality
> schedulers can't do much: RT tasks run outside their control and can
> potentially consume 100% of the CPU bandwidth.
> 
> Fix this by adding a sched_ext deadline server, so that sched_ext tasks are
> also boosted and do not suffer starvation.
> 
> Two kselftests are also provided to verify the starvation fixes and
> bandwidth allocation is correct.
> 
> == Highlights in this version ==
> 
>  - wait for inactive_task_timer() to fire before removing the bandwidth
>    reservation (Juri/Peter: please check if this new
>    dl_server_remove_params() implementation makes sense to you)
>  - removed the explicit dl_server_stop() from dequeue_task_scx() and rely
>    on the delayed stop behavior (Juri/Peter: ditto)
> 
> This patchset is also available in the following git branch:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server
> 
> Changes in v10:
>  - reordered patches to better isolate sched_ext changes vs sched/deadline
>    changes (Andrea Righi)
>  - define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi)
>  - add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi)
>  - wait for inactive_task_timer to fire before removing the bandwidth
>    reservation (Juri Lelli)
>  - remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer
>    reprogramming overhead (Juri Lelli)
>  - do not restart pick_task() when invoked by the dl_server (Tejun Heo)
>  - rename rq_dl_server to dl_server (Peter Zijlstra)
>  - fixed a missing dl_server start in dl_server_on() (Christian Loehle)
>  - add a comment to the rt_stall selftest to better explain the 4%
>    threshold (Emil Tsalapatis)
> 
> Changes in v9:
>  - Drop the ->balance() logic as its functionality is now integrated into
>    ->pick_task(), allowing dl_server to call pick_task_scx() directly
>  - Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/
> 
> Changes in v8:
>  - Add tj's patch to de-couple balance and pick_task and avoid changing
>    sched/core callbacks to propagate @rf
>  - Simplify dl_se->dl_server check (suggested by PeterZ)
>  - Small coding style fixes in the kselftests
>  - Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/
> 
> Changes in v7:
>  - Rebased to Linus master
>  - Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/
> 
> Changes in v6:
>  - Added Acks to few patches
>  - Fixes to few nits suggested by Tejun
>  - Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/
> 
> Changes in v5:
>  - Added a kselftest (total_bw) to sched_ext to verify bandwidth values
>    from debugfs
>  - Address comment from Andrea about redundant rq clock invalidation
>  - Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/
> 
> Changes in v4:
>  - Fixed issues with hotplugged CPUs having their DL server bandwidth
>    altered due to loading SCX
>  - Fixed other issues
>  - Rebased on Linus master
>  - All sched_ext kselftests reliably pass now, also verified that the
>    total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches
>  - Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/
> 
> Changes in v3:
>  - Removed code duplication in debugfs. Made ext interface separate
>  - Fixed issue where rq_lock_irqsave was not used in the relinquish patch
>  - Fixed running bw accounting issue in dl_server_remove_params
>  - Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/
> 
> Changes in v2:
>  - Fixed a hang related to using rq_lock instead of rq_lock_irqsave
>  - Added support to remove BW of DL servers when they are switched to/from EXT
>  - Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
> 
> Andrea Righi (5):
>       sched/deadline: Add support to initialize and remove dl_server bandwidth
>       sched_ext: Add a DL server for sched_ext tasks
>       sched/deadline: Account ext server bandwidth
>       sched_ext: Selectively enable ext and fair DL servers
>       selftests/sched_ext: Add test for sched_ext dl_server
> 
> Joel Fernandes (6):
>       sched/debug: Fix updating of ppos on server write ops
>       sched/debug: Stop and start server based on if it was active
>       sched/deadline: Clear the defer params
>       sched/deadline: Add a server arg to dl_server_update_idle_time()
>       sched/debug: Add support to change sched_ext server params
>       selftests/sched_ext: Add test for DL server total_bw consistency
> 
>  kernel/sched/core.c                              |   3 +
>  kernel/sched/deadline.c                          | 169 +++++++++++---
>  kernel/sched/debug.c                             | 171 +++++++++++---
>  kernel/sched/ext.c                               | 144 +++++++++++-
>  kernel/sched/fair.c                              |   2 +-
>  kernel/sched/idle.c                              |   2 +-
>  kernel/sched/sched.h                             |   8 +-
>  kernel/sched/topology.c                          |   5 +
>  tools/testing/selftests/sched_ext/Makefile       |   2 +
>  tools/testing/selftests/sched_ext/rt_stall.bpf.c |  23 ++
>  tools/testing/selftests/sched_ext/rt_stall.c     | 222 ++++++++++++++++++
>  tools/testing/selftests/sched_ext/total_bw.c     | 281 +++++++++++++++++++++++
>  12 files changed, 955 insertions(+), 77 deletions(-)
>  create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
>  create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
>  create mode 100644 tools/testing/selftests/sched_ext/total_bw.c
Thanks Andrea, I've tested a few things I had in mind with no complaints.
Most importantly it a) it doesn't break the existing fair_server and b)
Ensures BPF schedulers don't stall even with something like:
sudo chrt -r 95 stress-ng --cpu 0 --taskset 0-$(($(nproc)-1)) -t 30m
For patches 0 to 9:
Tested-by: Christian Loehle <christian.loehle@....com>
Powered by blists - more mailing lists
 
