linux-kernel - Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFpYl53ZMThWjQai@jlelli-thinkpadt14gen4.remote.csb>
Date: Tue, 24 Jun 2025 09:49:43 +0200
From: Juri Lelli <juri.lelli@...hat.com>
To: luca abeni <luca.abeni@...tannapisa.it>
Cc: Marcel Ziswiler <marcel.ziswiler@...ethink.co.uk>,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Vineeth Pillai <vineeth@...byteword.org>
Subject: Re: SCHED_DEADLINE tasks missing their deadline with
 SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)

On 20/06/25 18:52, luca abeni wrote:
> On Fri, 20 Jun 2025 17:28:28 +0200
> Juri Lelli <juri.lelli@...hat.com> wrote:
> 
> > On 20/06/25 16:16, luca abeni wrote:
> [...]
> > > So, I had a look tying to to remember the situation... This is my
> > > current understanding:
> > > - the max_bw field should be just the maximum amount of CPU
> > > bandwidth we want to use with reclaiming... It is rt_runtime_us /
> > > rt_period_us; I guess it is cached in this field just to avoid
> > > computing it every time.
> > >   So, max_bw should be updated only when
> > >   /proc/sys/kernel/sched_rt_{runtime,period}_us are written
> > > - the extra_bw field represents an additional amount of CPU
> > > bandwidth we can reclaim on each core (the original m-GRUB
> > > algorithm just reclaimed Uinact, the utilization of inactive tasks).
> > >   It is initialized to Umax when no SCHED_DEADLINE tasks exist and  
> > 
> > Is Umax == max_bw from above?
> 
> Yes; sorry about the confusion
> 
> 
> > >   should be decreased by Ui when a task with utilization Ui becomes
> > >   SCHED_DEADLINE (and increased by Ui when the SCHED_DEADLINE task
> > >   terminates or changes scheduling policy). Since this value is
> > >   per_core, Ui is divided by the number of cores in the root
> > > domain... From what you write, I guess extra_bw is not correctly
> > >   initialized/updated when a new root domain is created?  
> > 
> > It looks like so yeah. After boot and when domains are dinamically
> > created. But, I am still not 100%, I only see weird numbers that I
> > struggle to relate with what you say above. :)
> 
> BTW, when running some tests on different machines I think I found out
> that 6.11 does not exhibit this issue (this needs to be confirmed, I am
> working on reproducing the test with different kernels on the same
> machine)
> 
> If I manage to reproduce this result, I think I can run a bisect to the
> commit introducing the issue (git is telling me that I'll need about 15
> tests :)
> So, stay tuned...

The following seem to at least cure the problem after boot. Things are
still broken after cpusets creation. Moving to look into that, but
wanted to share where I am so that we don't duplicate work.

Rationale for the below is that we currently end up calling
__dl_update() with 'cpus' that are not stable yet. So, I tried to move
initialization after SMP is up (all CPUs have been onlined).

---
 kernel/sched/core.c     |  3 +++
 kernel/sched/deadline.c | 39 +++++++++++++++++++++++----------------
 kernel/sched/sched.h    |  1 +
 3 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8988d38d46a38..d152f8a84818b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8470,6 +8470,8 @@ void __init sched_init_smp(void)
 	init_sched_rt_class();
 	init_sched_dl_class();
 
+	sched_init_dl_servers();
+
 	sched_smp_initialized = true;
 }
 
@@ -8484,6 +8486,7 @@ early_initcall(migration_init);
 void __init sched_init_smp(void)
 {
 	sched_init_granularity();
+	sched_init_dl_servers();
 }
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ad45a8fea245e..9f3b3f3592a58 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1647,22 +1647,6 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = dl_se->rq;
 
-	/*
-	 * XXX: the apply do not work fine at the init phase for the
-	 * fair server because things are not yet set. We need to improve
-	 * this before getting generic.
-	 */
-	if (!dl_server(dl_se)) {
-		u64 runtime =  50 * NSEC_PER_MSEC;
-		u64 period = 1000 * NSEC_PER_MSEC;
-
-		dl_server_apply_params(dl_se, runtime, period, 1);
-
-		dl_se->dl_server = 1;
-		dl_se->dl_defer = 1;
-		setup_new_dl_entity(dl_se);
-	}
-
 	if (!dl_se->dl_runtime)
 		return;
 
@@ -1693,6 +1677,29 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 	dl_se->server_pick_task = pick_task;
 }
 
+void sched_init_dl_servers(void)
+{
+	int cpu;
+	struct rq *rq;
+	struct sched_dl_entity *dl_se;
+
+	for_each_online_cpu(cpu) {
+		u64 runtime =  50 * NSEC_PER_MSEC;
+		u64 period = 1000 * NSEC_PER_MSEC;
+
+		rq = cpu_rq(cpu);
+		dl_se = &rq->fair_server;
+
+		WARN_ON(dl_server(dl_se));
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
+
+		dl_se->dl_server = 1;
+		dl_se->dl_defer = 1;
+		setup_new_dl_entity(dl_se);
+	}
+}
+
 void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
 {
 	u64 new_bw = dl_se->dl_bw;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295e..22301c28a5d2d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -384,6 +384,7 @@ extern void dl_server_stop(struct sched_dl_entity *dl_se);
 extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task);
+extern void sched_init_dl_servers(void);
 
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);
-- 
2.49.0