linux-kernel - Re: [PATCH v2 1/2] sched/deadline: Improve reclaim bandwidth accuracy for GRUB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEXW_YRjNAUXFD8rPQQ53Vu8j8ka=Z-WEZS54JXEroVH7TPVjg@mail.gmail.com>
Date:   Wed, 10 May 2023 23:19:27 -0700
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Vineeth Pillai <vineeth@...byteword.org>
Cc:     luca.abeni@...tannapisa.it, Juri Lelli <juri.lelli@...hat.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Valentin Schneider <vschneid@...hat.com>,
        Jonathan Corbet <corbet@....net>, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org
Subject: Re: [PATCH v2 1/2] sched/deadline: Improve reclaim bandwidth accuracy
 for GRUB

Hi Vineeth,
Nice work. ;-) Few quick comments below:

On Wed, May 10, 2023 at 6:46 PM Vineeth Pillai <vineeth@...byteword.org> wrote:
>
> Current reclaim calculation for GRUB is a bit inaccurate and the
> inaccuracy gets larger as the bandwidth of tasks becomes smaller.
> I have a test program to show the issue - it runs one or more
> deadline threads and observes the utilization. Following tests
> are run on an isolated cpu(isolcpus=3) in a 4 cpu system and the
> results as shown below:
>
> RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
> TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.33
> TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35
> TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35
> TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.29
>
> RUN 2: runtime=2ms, deadline=period=10ms, RT capacity = 95%
> TID[704]: RECLAIM=1, (r=2ms, d=10ms, p=10ms), Util: 79.96
> TID[704]: RECLAIM=1, (r=2ms, d=10ms, p=10ms), Util: 80.06
> TID[704]: RECLAIM=1, (r=2ms, d=10ms, p=10ms), Util: 80.00
>
> RUN 3: runtime=1ms, deadline=period=100ms, RT capacity = 95%
> TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
> TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
> TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.70
>
> When running multiple tasks, the reclaimed bandwidth is divided
> proportionately, but is not reclaimed to the max allowable limit:
>
> RUN 4: 2 SCHED_FLAG_RECLAIM tasks, 1 normal task
>         Task 1: runtime=1ms, deadline=period=10ms
>         Task 2: runtime=1ms, deadline=period=10ms
>         Task 3: runtime=5ms, deadline=period=20ms(normal)
> TID[624]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 20.10
> TID[625]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 20.10
> TID[626]: RECLAIM=0, (r=5ms, d=20ms, p=20ms), Util: 25.07
> TID[624]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 20.06
> TID[625]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 20.13
> TID[626]: RECLAIM=0, (r=5ms, d=20ms, p=20ms), Util: 25.12
> TID[624]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 19.95
> TID[625]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 19.93
> TID[626]: RECLAIM=0, (r=5ms, d=20ms, p=20ms), Util: 25.04
>
> I have also tested multiple tasks on all cpus allowing for tasks to
> migrate and see the same issue there as well. Running 10 tasks on 3
> cpus with 6 SCHED_FLAG_RECLAIM and 4 normal tasks, top shows:
> %Cpu0  : 70.1 us,  0.3 sy,  0.0 ni, 29.3 id,  0.0 wa
> %Cpu1  : 69.1 us,  0.3 sy,  0.0 ni, 30.3 id,  0.3 wa
> %Cpu2  : 70.5 us,  0.3 sy,  0.0 ni, 29.2 id,  0.0 wa
>
> The max{} logic in the existing implementation seems to not fully
> capture the GRUB algorithm.
>
> This patch fixes the issue by appropriatley caping the max allowed
> utilization and also slightly adjusting GRUB algorithm to account
> for a mix of normal deadline and SCHED_FLAG_RECLAIM tasks.

Looked at the patch quickly as I am due for bedtime ;-). I feel like
this does more than should be done in one patch. So you should
consider splitting it IMHO.

> According to the GRUB rule, the runtime is depreciated as a factor
> of active bandwidth of the runqueue: "dq = -dt", where U is the

Saying "dq = -dt" where U does not make sense because U is not in the
equation ;-). Suggest rephrase.

> active bandwidth. Also, we do not allocate the full bandwidth of a
> cpu to deadline task, but only a portion(Umax) to it, so as to avoid
> deadline tasks starving lower class tasks. The equation could be
> re-written as "dq = -(U / Umax) * dt"

Isn't the equation in the code right now as:
dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt

?

That's what the kernel docs say [1].

So what do you mean by "could be re-written" ?

[1] https://docs.kernel.org/scheduler/sched-deadline.html

>
> Since both normal deadline and SCHED_FLAG_RECLAIM tasks can share
> cpu, we need to consider bandwidth of only SCHED_FLAG_RECLAIM tasks
> in the equation:
>         "dq = -(Ureclaim / Umax_reclaim) * dt"

This makes sense, we deplete only as a percent of what's reclaimable
and not the whole damn thing. You did test a mix of reclaim and
non-reclaim tasks so that's good, also since Luca approves that's
great. ;-)

>
> Following are the results with this patch:
>
> RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95%
> TID[616]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 94.98
> TID[616]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.04
> TID[616]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.01

All these look 100% correct to me. Do these tasks start at the same
time or are they shifted in their respective activations? Just wanted
to be sure it behaves the same either way...

[...]
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3e8df6d31c1e..13d85af0f42b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -257,6 +257,11 @@ static inline bool dl_entity_is_special(const struct sched_dl_entity *dl_se)
>  #endif
>  }
>
> +static inline bool dl_entity_is_reclaim(const struct sched_dl_entity *dl_se)
> +{
> +       return dl_se->flags & SCHED_FLAG_RECLAIM;
> +}
> +

Can this helper addition be split out to a different patch?

>  /*
>   * Tells if entity @a should preempt entity @b.
>   */
> @@ -754,10 +759,20 @@ struct dl_rq {
>         u64                     extra_bw;
>
>         /*
> -        * Inverse of the fraction of CPU utilization that can be reclaimed
> -        * by the GRUB algorithm.
> +        * Maximum available bandwidth for this runqueue. This is used to
> +        * calculate reclaimable bandwidth for SCHED_FLAG_RECLAIM tasks.
> +        * By restricting maximum usable bandwidth, we aim to give other
> +        * tasks on lower classes a chance to run, when competing with
> +        * SCHED_FLAG_RECLAIM tasks.
>          */
> -       u64                     bw_ratio;
> +       u64                     max_bw;
> +
> +       /*
> +        * Active bandwidth of SCHED_FLAG_RECLAIM tasks on this rq.
> +        * This will be a subset of running_bw.
> +        */
> +       u64                     reclaim_bw;
> +

And perhaps addition and use of these new fields if it makes sense.

I will take a closer look at your patches later or after v2..

 - Joel


>  };
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> --
> 2.40.1
>