linux-kernel - Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f5801f5b-6ee8-6b84-b6bb-46e89b165091@arm.com>
Date:   Fri, 19 May 2023 19:56:38 +0200
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Vineeth Pillai <vineeth@...byteword.org>,
        luca.abeni@...tannapisa.it, Juri Lelli <juri.lelli@...hat.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Valentin Schneider <vschneid@...hat.com>
Cc:     Jonathan Corbet <corbet@....net>, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org
Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Vineeth,

On 15/05/2023 04:57, Vineeth Pillai wrote:
> In a multi-processor system, bandwidth usage is divided equally to
> all cpus. This causes issues with reclaiming free bandwidth on a cpu.
> "Uextra" is same on all cpus in a root domain and running_bw would be
> different based on the reserved bandwidth of tasks running on the cpu.
> This causes disproportionate reclaiming - task with lesser bandwidth
> reclaims less even if its the only task running on that cpu.
> 
> Following is a small test with three tasks with reservations (8,10)
> (1,10) and (1, 100). These three tasks run on different cpus. But
> since the reclamation logic calculates available bandwidth as a factor
> of globally available bandwidth, tasks with lesser bandwidth reclaims
> only little compared to higher bandwidth even if cpu has free and
> available bandwidth to be reclaimed.
> 
> TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
> TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
> TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16

What does this 'Util: X' value stand for? I assume it's the utilization
of the task? How do you obtain it?

I see that e.g. TID[731] should run 1ms each 10ms w/o grub and with grub
the runtime could be potentially longer since 'scaled_delta_exec < delta'.

I don't get this comment in update_curr_dl():

1325    /*
1326     * For tasks that participate in GRUB, we implement GRUB-PA: the
1327     * spare reclaimed bandwidth is used to clock down frequency.
1328     *

It looks like dl_se->runtime is affected and with 'scaled_delta_exec <
delta' the task runs longer than dl_se->dl_runtime?

> Fix: use the available bandwidth on each cpu to calculate reclaimable
> bandwidth. Admission control takes care of total bandwidth and hence
> using the available bandwidth on a specific cpu would not break the
> deadline guarentees.
> 
> With this fix, the above test behaves as follows:
> TID[586]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.24
> TID[585]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 95.01
> TID[584]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.01
> 
> Signed-off-by: Vineeth Pillai (Google) <vineeth@...byteword.org>
> ---
>  kernel/sched/deadline.c | 22 +++++++---------------
>  1 file changed, 7 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 91451c1c7e52..85902c4c484b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1272,7 +1272,7 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
>   *	Umax:		Max usable bandwidth for DL. Currently
>   *			= sched_rt_runtime_us / sched_rt_period_us
>   *	Uextra:		Extra bandwidth not reserved:
> - *			= Umax - \Sum(u_i / #cpus in the root domain)
> + *			= Umax - this_bw
>   *	u_i:		Bandwidth of an admitted dl task in the
>   *			root domain.
>   *
> @@ -1286,22 +1286,14 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
>   */
>  static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
>  {
> -	u64 u_act;
> -	u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
> -
>  	/*
> -	 * Instead of computing max{u, (rq->dl.max_bw - u_inact - u_extra)},
> -	 * we compare u_inact + rq->dl.extra_bw with
> -	 * rq->dl.max_bw - u, because u_inact + rq->dl.extra_bw can be larger
> -	 * than rq->dl.max_bw (so, rq->dl.max_bw - u_inact - rq->dl.extra_bw
> -	 * would be negative leading to wrong results)
> +	 * max{u, Umax - Uinact - Uextra}
> +	 * = max{u, max_bw - (this_bw - running_bw) + (this_bw - running_bw)}
> +	 * = max{u, running_bw} = running_bw
> +	 * So dq = -(max{u, Umax - Uinact - Uextra} / Umax) dt
> +	 *       = -(running_bw / max_bw) dt
>  	 */
> -	if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw)
> -		u_act = dl_se->dl_bw;
> -	else
> -		u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw;
> -
> -	return div64_u64(delta * u_act, rq->dl.max_bw);
> +	return div64_u64(delta * rq->dl.running_bw, rq->dl.max_bw);

I did the test discussed later in this thread with:

3 [3/100] tasks (dl_se->dl_bw = (3 << 20)/100 = 31457) on 3 CPUs

factor = scaled_delta_exec/delta

- existing grub

rq->dl.bw_ratio = ( 100 << 8 ) / 95 = 269
rq->dl.extra_bw = ( 95 << 20 ) / 100 = 996147

cpu=2 curr->[thread0-2 1715] delta=2140100 this_bw=31457
running_bw=31457 extra_bw=894788 u_inact=0 u_act_min=33054 u_act=153788
scaled_delta_exec=313874 factor=0.14

- your solution patch [1-2]

cpu=2 curr->[thread0-0 1676] delta=157020 running_bw=31457 max_bw=996147
res=4958 factor=0.03

You say that GRUB calculation is inaccurate and that this inaccuracy
gets larger as the bandwidth of tasks becomes smaller.

Could you explain this inaccuracy on this example?