lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260126100050.3854740-1-arighi@nvidia.com>
Date: Mon, 26 Jan 2026 10:58:58 +0100
From: Andrea Righi <arighi@...dia.com>
To: Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Tejun Heo <tj@...nel.org>,
	Joel Fernandes <joelagnelf@...dia.com>,
	David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Daniel Hodges <hodgesd@...a.com>,
	Christian Loehle <christian.loehle@....com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: [PATCHSET v12 sched_ext/for-6.20] Add a deadline server for sched_ext tasks

sched_ext tasks can be starved by long-running RT tasks, especially since
RT throttling was replaced by deadline servers to boost only SCHED_NORMAL
tasks.

Several users in the community have reported issues with RT stalling
sched_ext tasks. This is fairly common on distributions or environments
where applications like video compositors, audio services, etc. run as RT
tasks by default.

Example trace (showing a per-CPU kthread stalled due to the sway Wayland
compositor running as an RT task):

 runnable task stall (kworker/0:0[106377] failed to run for 5.043s)
 ...
 CPU 0   : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738
           curr=sway[994] class=rt_sched_class
   R kworker/0:0[106377] -5043ms
       scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
       sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000
       cpus=01

This is often perceived as a bug in the BPF schedulers, but in reality they
can't do much: RT tasks run outside their control and can potentially
consume 100% of the CPU bandwidth.

Fix this by adding a sched_ext deadline server, so that sched_ext tasks are
also boosted and do not suffer starvation.

Two kselftests are also provided to verify the starvation fixes and
bandwidth allocation is correct.

== Design ==

 - The EXT server is initialized at boot time and remains configured
   throughout the system's lifetime
 - It starts automatically when the first sched_ext task is enqueued
   (rq->scx.nr_running == 1)
 - The server's pick function (ext_server_pick_task) always selects
   sched_ext tasks when active
 - Runtime accounting happens in update_curr_scx() during task execution
   and update_curr_idle() when idle
 - Bandwidth accounting includes both fair and ext servers in root domain
   calculations
 - A debugfs interface (/sys/kernel/debug/sched/ext_server/) allows runtime
   tuning of server parameters (see notes below)

== Notes ==

1) As discussed during the sched_ext microconference at LPC Tokyo, the plan
is to start with a simple approach, avoiding automatically creating or
tearing down the EXT server bandwidth reservation when a BPF scheduler is
loaded or unloaded. Instead, the reservation is kept permanently active.
This significantly simplifies the logic while still addressing the
starvation issue.

Any fine-tuning of the bandwidth reservation is delegated to the system
administrator, who can adjust it via the debugfs interface. In the future,
a more suitable interface can be introduced and automatic removal of the
reservation when the BPF scheduler is unloaded can be revisited.

A better interface to adjust the dl_server bandwidth reservation can be
discussed at the upcoming OSPM
(https://lore.kernel.org/lkml/aULDwbALUj0V7cVk@jlelli-thinkpadt14gen4.remote.csb/).

2) IMPORTANT: this patch requires [1] to function properly (sent
separately, not included in this patch set).

[1] https://lore.kernel.org/all/20260123161645.2181752-1-arighi@nvidia.com/

This patchset is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server

Changes in v12:
 - Move dl_server execution state reset on stop fix to a separate patch
   (https://lore.kernel.org/all/20260123161645.2181752-1-arighi@nvidia.com/)
 - Removed per-patch changelog (keeping a global changelog here)
 - Link to v11: https://lore.kernel.org/all/20260120215808.188032-1-arighi@nvidia.com/

Changes in v11:
 - do not create/remove the bandwidth reservation for the ext server when a
   BPF scheduler is loaded/unloaded, but keep the reservation bandwdith
   always active
 - change rt_stall kselftest to validate both FAIR and EXT DL servers
 - Link to v10: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/

Changes in v10:
 - reordered patches to better isolate sched_ext changes vs sched/deadline
   changes (Andrea Righi)
 - define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi)
 - add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi)
 - wait for inactive_task_timer to fire before removing the bandwidth
   reservation (Juri Lelli)
 - remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer
   reprogramming overhead (Juri Lelli)
 - do not restart pick_task() when invoked by the dl_server (Tejun Heo)
 - rename rq_dl_server to dl_server (Peter Zijlstra)
 - fixed a missing dl_server start in dl_server_on() (Christian Loehle)
 - add a comment to the rt_stall selftest to better explain the 4%
   threshold (Emil Tsalapatis)
 - Link to v9: https://lore.kernel.org/all/20251017093214.70029-1-arighi@nvidia.com/

Changes in v9:
 - Drop the ->balance() logic as its functionality is now integrated into
   ->pick_task(), allowing dl_server to call pick_task_scx() directly
 - Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/

Changes in v8:
 - Add tj's patch to de-couple balance and pick_task and avoid changing
   sched/core callbacks to propagate @rf
 - Simplify dl_se->dl_server check (suggested by PeterZ)
 - Small coding style fixes in the kselftests
 - Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/

Changes in v7:
 - Rebased to Linus master
 - Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/

Changes in v6:
 - Added Acks to few patches
 - Fixes to few nits suggested by Tejun
 - Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/

Changes in v5:
 - Added a kselftest (total_bw) to sched_ext to verify bandwidth values
   from debugfs
 - Address comment from Andrea about redundant rq clock invalidation
 - Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/

Changes in v4:
 - Fixed issues with hotplugged CPUs having their DL server bandwidth
   altered due to loading SCX
 - Fixed other issues
 - Rebased on Linus master
 - All sched_ext kselftests reliably pass now, also verified that the
   total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches
 - Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/

Changes in v3:
 - Removed code duplication in debugfs. Made ext interface separate
 - Fixed issue where rq_lock_irqsave was not used in the relinquish patch
 - Fixed running bw accounting issue in dl_server_remove_params
 - Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/

Changes in v2:
 - Fixed a hang related to using rq_lock instead of rq_lock_irqsave
 - Added support to remove BW of DL servers when they are switched to/from EXT
 - Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/

Andrea Righi (2):
      sched_ext: Add a DL server for sched_ext tasks
      selftests/sched_ext: Add test for sched_ext dl_server

Joel Fernandes (5):
      sched/deadline: Clear the defer params
      sched/debug: Fix updating of ppos on server write ops
      sched/debug: Stop and start server based on if it was active
      sched/debug: Add support to change sched_ext server params
      selftests/sched_ext: Add test for DL server total_bw consistency

 kernel/sched/core.c                              |   6 +
 kernel/sched/deadline.c                          |  86 +++++--
 kernel/sched/debug.c                             | 171 +++++++++++---
 kernel/sched/ext.c                               |  33 +++
 kernel/sched/idle.c                              |   3 +
 kernel/sched/sched.h                             |   2 +
 kernel/sched/topology.c                          |   5 +
 tools/testing/selftests/sched_ext/Makefile       |   2 +
 tools/testing/selftests/sched_ext/rt_stall.bpf.c |  23 ++
 tools/testing/selftests/sched_ext/rt_stall.c     | 240 +++++++++++++++++++
 tools/testing/selftests/sched_ext/total_bw.c     | 281 +++++++++++++++++++++++
 11 files changed, 801 insertions(+), 51 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
 create mode 100644 tools/testing/selftests/sched_ext/total_bw.c

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ