linux-kernel - Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <01579005-e5d3-4c53-9331-45324542ca34@arm.com>
Date:   Thu, 25 May 2023 13:55:37 +0200
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Vineeth Remanan Pillai <vineeth@...byteword.org>
Cc:     luca.abeni@...tannapisa.it, Juri Lelli <juri.lelli@...hat.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Valentin Schneider <vschneid@...hat.com>,
        Jonathan Corbet <corbet@....net>, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, youssefesmat@...gle.com
Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi  Vineeth,

On 20/05/2023 04:15, Vineeth Remanan Pillai wrote:
> Hi Dietmar,
> 
> On Fri, May 19, 2023 at 1:56 PM Dietmar Eggemann
> <dietmar.eggemann@....com> wrote:
> 
>>> TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
>>> TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
>>> TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16
>>
>> What does this 'Util: X' value stand for? I assume it's the utilization
>> of the task? How do you obtain it?
>>
> Yes, it is the utilization of the task. I calculate it by dividing the
> cputime with elapsed time(using clock_gettime(2)).

Makes, sense, I guess what I missed here in the first place is the fact
that those DL tasks want to run 100%.

>> I see that e.g. TID[731] should run 1ms each 10ms w/o grub and with grub
>> the runtime could be potentially longer since 'scaled_delta_exec < delta'.
>>
> Yes correct. GRUB(Greedy Reclamation of Unused Bandwidth) algorithm
> is used here for deadline tasks that needs to run longer than their
> runtime when needed. sched_setattr allows a flag SCHED_FLAG_RECLAIM
> to indicate that the task would like to reclaim unused bandwidth of a
> cpu if available. For those tasks, 'runtime' is depreciated using the
> GRUB formula and it allows it to run for longer and reclaim the free
> bandwidth of the cpu. The GRUB implementation in linux allows a task
> to reclaim upto RT capacity(95%) and depends on the free bandwidth
> of the cpu. So TID[731] theoretically should run for 95ms as it is
> the only task in the cpu, but it doesn't get to run that long.

Correct.

>> I don't get this comment in update_curr_dl():
>>
>> 1325    /*
>> 1326     * For tasks that participate in GRUB, we implement GRUB-PA: the
>> 1327     * spare reclaimed bandwidth is used to clock down frequency.
>> 1328     *
>>
>> It looks like dl_se->runtime is affected and with 'scaled_delta_exec <
>> delta' the task runs longer than dl_se->dl_runtime?
>>
> Yes. As mentioned above, GRUB allows the task to run longer by slowing
> down the depreciation of "dl_se->dl_runtime". scaled_delta_exec is
> calculated by the GRUB formula explained in the paper [1] & [2].

What I didn't understand was this `GRUB-PA` and `the spare reclaimed
bandwidth is used to clock down frequency` in relation to GRUB task
runtime depreciation.

But now I think I get it. `GRUB-PA` means that in case we run with the
schedutil CPUfreq governor, the CPU frequency is influenced by Uact
(rq->dl.running_bw) via:

sugov_get_util() -> effective_cpu_util() -> cpu_bw_dl() ->

      return rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT

and on top of this we do GRUB reclaiming for those SCHED_FLAG_RECLAIM
tasks, i.e. task runtime depreciation.

>> I did the test discussed later in this thread with:
>>
>> 3 [3/100] tasks (dl_se->dl_bw = (3 << 20)/100 = 31457) on 3 CPUs
>>
>> factor = scaled_delta_exec/delta
>>
>> - existing grub
>>
>> rq->dl.bw_ratio = ( 100 << 8 ) / 95 = 269
>> rq->dl.extra_bw = ( 95 << 20 ) / 100 = 996147
>>
>> cpu=2 curr->[thread0-2 1715] delta=2140100 this_bw=31457
>> running_bw=31457 extra_bw=894788 u_inact=0 u_act_min=33054 u_act=153788
>> scaled_delta_exec=313874 factor=0.14
>>
>> - your solution patch [1-2]
>>
>> cpu=2 curr->[thread0-0 1676] delta=157020 running_bw=31457 max_bw=996147
>> res=4958 factor=0.03
>>
>> You say that GRUB calculation is inaccurate and that this inaccuracy
>> gets larger as the bandwidth of tasks becomes smaller.
>>
>> Could you explain this inaccuracy on this example?
>>
> According to GRUB, we should be able to reclaim the unused bandwidth
> for the running task upto RT limits(95%). In this example we have a
> task with 3ms runtime and 100ms runtime on a cpu. So it is supposed
> to run for 95ms before it is throttled.

Correct.

> Existing implementation's factor = 0.14 and 3ms is depreciated by
> this factor. So it gets to run for "3 / 0.14 ~= 22ms". This is the
> inaccuracy that the patch is trying to solve. With the patch, the
> factor is .03166 and runtime = "3 / 0.03166 ~= 95ms"

My tests were wrong since I was using DL task with dl_runtime=3ms and
dl_period = 100ms with an actual runtime=3ms whereas your tasks probably
want to run 100%.

> Hope this clarifies.

yes, it did, thanks!