linux-kernel - Re: [PATCH v2 15/15] sched/deadline: Always start a new period if CFS exceeded DL runtime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fda31cb5-2297-4559-a462-124fb459e263@redhat.com>
Date: Fri, 5 Apr 2024 11:19:29 +0200
From: Daniel Bristot de Oliveira <bristot@...hat.com>
To: "Joel Fernandes (Google)" <joel@...lfernandes.org>,
 linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>
Cc: Suleiman Souhlal <suleiman@...gle.com>,
 Youssef Esmat <youssefesmat@...gle.com>, David Vernet <void@...ifault.com>,
 Thomas Gleixner <tglx@...utronix.de>, "Paul E . McKenney"
 <paulmck@...nel.org>, joseph.salisbury@...onical.com,
 Luca Abeni <luca.abeni@...tannapisa.it>,
 Tommaso Cucinotta <tommaso.cucinotta@...tannapisa.it>,
 Vineeth Pillai <vineeth@...byteword.org>,
 Shuah Khan <skhan@...uxfoundation.org>, Phil Auld <pauld@...hat.com>
Subject: Re: [PATCH v2 15/15] sched/deadline: Always start a new period if CFS
 exceeded DL runtime

On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> We believe that this is the right thing to do. The unit test
> (cs_dlserver_test) also agrees. If we let the CFS run without starting a
> new period, while the server is regularly throttled, then the test fails
> because CFS does not appear to get enough bandwidth.
> 
> Intuitively, this makes sense to do as well. If CFS used up all the CFS
> bandwidth, while the DL server was in a throttled state, it got the
> bandwidth it wanted and some. Now, we can start all over from scratch to
> guarantee it a minimum bandwidth.

So, this part of the code is not actually fundamental for the defer part, it was
added as an optimization... but it has a problem...

> Signed-off-by: Joel Fernandes (Google) <joel@...lfernandes.org>
> ---
>  kernel/sched/deadline.c | 17 -----------------
>  1 file changed, 17 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 179369d27f66..a0ea668ac1bf 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1454,23 +1454,6 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
>  	 * starting a new period, pushing the activation to the zero-lax time.
>  	 */
>  	if (dl_se->dl_defer && dl_se->dl_throttled && dl_runtime_exceeded(dl_se)) {
> -		s64 runtime_diff = dl_se->runtime + dl_se->dl_runtime;
> -
> -		/*
> -		 * If this is a regular throttling case, let it run negative until
> -		 * the dl_runtime - runtime > 0. The reason being is that the next
> -		 * replenishment will result in a positive runtime one period ahead.
> -		 *
> -		 * Otherwise, the deadline will be pushed more than one period, not
> -		 * providing runtime/period anymore.
> -		 *
> -		 * If the dl_runtime - runtime < 0, then the server was able to get
> -		 * the runtime/period before the replenishment. So it is safe
> -		 * to start a new deffered period.
> -		 */
> -		if (!dl_se->dl_defer_armed && runtime_diff > 0)
> -			return;

The idea was to reduce the frequency in which the timer is reset, aiming to avoid
regressions in the regular case in which the dl server never actually fires. It works
fine *if* the runtime is relatively low to the period... like 5%.... as it gets bigger,
it starts breaking things. In the case of > 50% of runtime, it breaks. That is
the case you guys seem to have.

At LPC, I actually expressed the concern to Vincent about resetting this timer. But
he mentioned that it was not a big of deal because it does not happen that often
to cause problems.

so, yeah, it is better to remove... one can always get back to this and think
on a logic that postpones the reset depending on the % of runtime of the dl server.

Removed on v6

-- Daniel

>  		hrtimer_try_to_cancel(&dl_se->dl_timer);
>  
>  		replenish_dl_new_period(dl_se, dl_se->rq);