[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250929092221.10947-1-yurand2000@gmail.com>
Date: Mon, 29 Sep 2025 11:21:57 +0200
From: Yuri Andriaccio <yurand2000@...il.com>
To: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>
Cc: linux-kernel@...r.kernel.org,
Luca Abeni <luca.abeni@...tannapisa.it>,
Yuri Andriaccio <yuri.andriaccio@...tannapisa.it>
Subject: [RFC PATCH v3 00/24] Hierarchical Constant Bandwidth Server
Hello,
This is the v3 for Hierarchical Constant Bandwidth Server, aiming at replacing
the current RT_GROUP_SCHED mechanism with something more robust and
theoretically sound. The patchset has been presented at OSPM25
(https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
be found at https://lwn.net/Articles/1021332/ . You can find the previous
versions of this patchset at the bottom of the page, in particular version 1
which talks in more detail what this patchset is all about and how it is
implemented.
This v3 version further reworks some of the patches as suggested by Juri Lelli.
While most of the work is refactorings, the following were also changed:
- The first patch which removed fair-servers' bandwidth accounting has been
removed, as it was deemed wrong. You can find the last version of this removed
patch, just for history reasons, here:
https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/
- A left-over check which prevented execution of some of wakeup_preempt code has
been removed.
- Cgroup pull code was erroneusly comparing cgroup with non-cgroup tasks, now it
has been fixed.
- The allocation/deallocation code for rt cgroups has been checked and reworked
to make sure that resources are managed correctly in all the code paths.
- Some signatures of cgroup migration related functions where changed to match
more closely to their non-group counterparts.
- Descriptions and documentation were added where necessary, in particular for
preemption rules in wakeup_preempt.
For this v3 version we've also polished the testing system we are using and made
it public for testers to run on their own machines. The source code can be found
at https://github.com/Yurand2000/HCBS-Test-Suite , along with a README that
explains how to use it. Nonetheless I've reported a description of the tools and
instruction later in the page.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Summary of the patches:
1-4) Preparation patches, so that the RT classes' code can be used both
for normal and cgroup scheduling.
5-15) Implementation of HCBS, no migration and only one level hierarchy.
The old RT_GROUP_SCHED code is removed.
16-17) Remove cgroups v1 in favour of v2.
18) Add support for deeper hierarchies.
19-24) Add support for tasks migration.
Updates from v2:
- Rebase to latest tip/master.
- Remove fair-servers' bw reclaiming.
- Fix a check which prevented execution of wakeup_preempt code.
- Fix a priority check in group_pull_rt_task between tasks of different groups.
- Rework allocation/deallocation code for rt-cgroups.
- Update signatures for some group related migration functions.
- Add documentation for wakeup_preempt preemption rules.
Updates from v1:
- Rebase to latest tip/master.
- Add migration code.
- Split big patches for more readability.
- Refactor code to use guarded locks where applicable.
- Remove unnecessary patches from v1 which have been addressed differently by
mainline updates.
- Remove unnecessary checks and general code cleanup.
Notes:
Task migration support needs some extra work to reduce its invasiveness,
especially patches 21-22.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing v3:
The HCBS mechanism has been evaluated on several syntetic tests which are
designed to stress the HCBS scheduler and verify that non-interference and
mathematical schedulability guarantees are really enforced by the scheduling
algorithm.
The test suite currently runs different categories of tests:
- Constraints, which are tasked to assert that hard constraints, such as
schedulability conditions, are respected.
- Regression, to check that HCBS does not break anything that already exists.
- Stress, to repeatedly invoke the scheduler in all the exposed interfaces,
with the goal to detect bugs and more importantly race conditions.
- Time, simple benchmarks to assert that the dl_servers work correctly, i.e.
they allocate the correct amount of bandwidth, and that migration code allows
to fully utilize the cgroup's allocated bw.
- Taskset: given a set of (generated) periodic tasks and their bandwidth
requirements, schedulability analyses are performed to decide whether or not a
given hardware configuration can run the taskset. In particular, for each
taskset, a HCBS's cgroup configuration along with the number of necessary CPUs
is generated. These are mathematically guaranteed to be schedulable.
The next step of this test suite is to configure cgroups as computed and to
run the taskset, to verify that the HCBS implementation works as intended and
that the scheduling overheads are within reasonable bounds.
The source code can be found at https://github.com/Yurand2000/HCBS-Test-Suite .
The README file should explain most if not all questions, but I'm writing
briefly the pipeline to run these tests here:
- Get the HCBS patch up and running. Any kernel/disto should work effortlessly.
- Get, compile and _install_ the tests.
- Download the additional taskset files and extract them in the _install_
folder. You can find them here:
https://github.com/Yurand2000/HCBS-Test-Suite/releases/tag/250926
- Run the `run_tests.sh full` script, to run the whole test suite.
Expect a total runtime of ~3 hours. The script will automatically mount the
cgroup and debug filesystems (if not already mounted) and will move all the
already running SCHED_FIFO/SCHED_RR tasks in the root cgroup, so that the
cgroups' CPU controller can be mounted. It will additionally try to reserve all
the possible rt-bandwidth for cgroups (i.e. 90%) to run all the later tests, so
make sure that there are no running SCHED_DEADLINE tasks if the script fails to
setup.
Some tests specifically need a minimum amount of CPU cores, up to a maximum of
eight. If your machine has less CPUs then the tests will simply be skipped.
Notes:
The tasksets minimal requirements were computed using a closed-source software,
explaining why the tasksets are supplied separately. A open-source analyser is
being written to update this step in the future and also allow for more
customization for the testers.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Future Work:
While we wait for more comments, and expect stuff to break, we will work on
completing the currently partial/untested, implementation of HCBS with different
runtimes per CPU, instead of having the same runtime allocated on all CPUs, to
include it in a future RCF.
Future patches:
- HCBS with different runtimes per CPU.
- capacity aware bandwidth reservation.
- enable/disable dl_servers when a CPU goes online/offline.
Have a nice day,
Yuri
v1: https://lore.kernel.org/all/20250605071412.139240-1-yurand2000@gmail.com/
v2: https://lore.kernel.org/all/20250731105543.40832-1-yurand2000@gmail.com/
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Yuri Andriaccio (6):
sched/rt: Disable RT_GROUP_SCHED
sched/rt: Add rt-cgroups' dl-servers operations.
sched/rt: Update task event callbacks for HCBS scheduling
sched/rt: Allow zeroing the runtime of the root control group
sched/rt: Remove support for cgroups-v1
sched/core: Execute enqueued balance callbacks when migrating task
betweeen cgroups
luca abeni (18):
sched/deadline: Do not access dl_se->rq directly
sched/deadline: Distinct between dl_rq and my_q
sched/rt: Pass an rt_rq instead of an rq where needed
sched/rt: Move some functions from rt.c to sched.h
sched/rt: Introduce HCBS specific structs in task_group
sched/core: Initialize root_task_group
sched/deadline: Add dl_init_tg
sched/rt: Add {alloc/free}_rt_sched_group
sched/deadline: Account rt-cgroups bandwidth in deadline tasks
schedulability tests.
sched/rt: Update rt-cgroup schedulability checks
sched/rt: Remove old RT_GROUP_SCHED data structures
sched/core: Cgroup v2 support
sched/deadline: Allow deeper hierarchies of RT cgroups
sched/rt: Add rt-cgroup migration
sched/rt: Add HCBS migration related checks and function calls
sched/deadline: Make rt-cgroup's servers pull tasks on timer
replenishment
sched/deadline: Fix HCBS migrations on server stop
sched/core: Execute enqueued balance callbacks when changing allowed
CPUs
include/linux/sched.h | 10 +-
kernel/sched/autogroup.c | 4 +-
kernel/sched/core.c | 65 +-
kernel/sched/deadline.c | 251 +++-
kernel/sched/debug.c | 6 -
kernel/sched/fair.c | 6 +-
kernel/sched/rt.c | 3069 +++++++++++++++++++-------------------
kernel/sched/sched.h | 150 +-
kernel/sched/syscalls.c | 6 +-
9 files changed, 1850 insertions(+), 1717 deletions(-)
base-commit: cec1e6e5d1ab33403b809f79cd20d6aff124ccfe
--
2.51.0
Powered by blists - more mailing lists