linux-kernel - Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAO7JXPhZPvzVRyL87qNT5VnaVOf=0wrRftFB-Rjx-vJc3JUMog@mail.gmail.com>
Date:   Fri, 19 May 2023 12:12:50 -0400
From:   Vineeth Remanan Pillai <vineeth@...byteword.org>
To:     luca abeni <luca.abeni@...tannapisa.it>
Cc:     Juri Lelli <juri.lelli@...hat.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Valentin Schneider <vschneid@...hat.com>,
        Jonathan Corbet <corbet@....net>, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org
Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Luca,

On Fri, May 19, 2023 at 6:18 AM luca abeni <luca.abeni@...tannapisa.it> wrote:
>
> On Fri, 19 May 2023 11:56:21 +0200
> luca abeni <luca.abeni@...tannapisa.it> wrote:
> [...]
>
> OK, sorry again... I found my error immediately after sending the email.
> Uextra is computed as "Umax - ...", not "1 - ...".
> So, I now understand where the 35% comes from.
>
Thanks for debugging this, it makes sense now!

> I now _suspect_ the correct equation should be
>         dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt
> but I want to test it before wasting your time again; I'll write more
> after performing some more tests.
>
I tried this equation and it fixes the above issue. But a little
confused as to why we should not be limiting the second term to Umax?
In my testing, I was seeing the issue solved with this equation as
well:
 "dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt"

With both these equations, it doesn't solve couple of other issues we
had discussed before:
- tasks with different bandwidth reclaims differently even when #tasks
  is less than #cpus.
- cpu util may go to 100% when we have tasks with large bandwidth close
  to Umax

As an eg. for issue 1, three tasks - (7,10) (3,10) and (1,10):
TID[590]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.20
TID[591]: RECLAIM=1, (r=3ms, d=10ms, p=10ms), Util: 81.94
TID[592]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 27.19

re. issue 2, four tasks with same reservation (7,10), tasks tries
to reclaim leading to 100% cpu usage on all three cpus and leads to
system hang.

I was trying to understand the issue and it looks like static values
of Uextra and Umax are causing inaccuracies. Uextra is calculated
based on global bandwidth But based on the local state of the cpu, we
could reclaim more or less than (u_inact + rq->dl.extra_bw). Similarly
Umax is a local max for each cpu and we should not be reclaiming upto
Umax unconditionally. If the global load is high, reclaiming upto Umax
would cause problems as a migration can land in.

I was trying an idea to dynamically decide Uextra and Umax based on
global and local load.

The crux of the idea is as below

+ if (rq->dl.running_bw > rq->dl.max_bw)
+ return delta;
+
+ max_bw = rq->dl.this_bw + rq->dl.extra_bw;
+ extra_bw = rq->dl.extra_bw;
+ if (rq->dl.this_bw < rq->dl.extra_bw || max_bw > rq->dl.max_bw) {
+ extra_bw = rq->dl.max_bw - rq->dl.this_bw;
+ max_bw = rq->dl.max_bw;
+ }
+

And use this max_bw and extra_bw in the equation:
 "dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt"

The reasoning for above changes are:
- running_bw can be greater than max_bw in SMP and we should not be
  reclaiming if that's the case
- Initially we assume max_bw and extra_bw, based on global load.
  If this_bw < extra_bw, it means that cpus are not heavily loaded
  and incoming migrations can be satisfied with local cpu's available
  bandwidth and we could reclaim upto rq->dl.max_bw and use extra_bw
  as "rq->dl.max_bw - rq->dl.this_bw". Also we should limit max_bw for
  any cpu to a maximum of rq->dl.max_cpu.

This does not consider mix of normal and SCHED_FLAG_RECLAIM tasks and
minor changes on top of this are need to consider that scenario. This
seems to fix all the issues that I have encountered.

The above fix doesn't work on equation:
  "dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt"

cpu still spikes to 100% with 4 tasks of (7,10). I am not sure, but I
guess it might be because we are not limiting the second term to Umax.

Please let me know your thoughts on this.

Thanks again,
Vineeth