linux-kernel - Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO7JXPh5uLV4QjAEi6bJXfAGSsZ=XsnCyzrvKS8m35BGbRPYJw@mail.gmail.com>
Date:   Tue, 16 May 2023 11:08:18 -0400
From:   Vineeth Remanan Pillai <vineeth@...byteword.org>
To:     luca abeni <luca.abeni@...tannapisa.it>
Cc:     Juri Lelli <juri.lelli@...hat.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Valentin Schneider <vschneid@...hat.com>,
        Jonathan Corbet <corbet@....net>, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org
Subject: Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Hi Luca,

On Tue, May 16, 2023 at 3:37 AM luca abeni <luca.abeni@...tannapisa.it> wrote:
> > I have noticed this behaviour where the reclaimed time is not equally
> > distributed when we have more tasks than available processors. But it
> > depended on where the task was scheduled. Within the same cpu, the
> > distribution seemed to be proportional.
>
> Yes, as far as I remember it is due to migrations. IIRC, the problem is
> related to the fact that using "dq = -Uact / Umax * dt" a task running
> on a core might end up trying to reclaim some idle time from other
> cores (which is obviously not possible).
> This is why m-GRUB used "1 - Uinact" instead of "Uact"
>
This is what I was a little confused about. In "-Uact / Umax", all
the variables are per-cpu and it should only be reclaiming what is
free on the cpu right? And when migration happens, Uact changes
and the reclaiming adapts itself. I was thinking it should probably
be okay for tasks to reclaim differently based on what free bw is
left on the cpu it is running. For eg: if cpu 1 has two tasks of bw
.3 each, each task can reclaim "(.95 - .6) / 2" and another cpu with
only one task(.3 bandwidth) reclaims (.95 - .3). So both cpus
utilization is .95 and tasks reclaim what is available on the cpu.

With "1 - Uinact", where Uinact accounts for a portion of global free
bandwidth, tasks reclaim proportionately to the global free bandwidth
and this causes tasks with lesser bandwidth to reclaim lesser when
compared to higher bandwidth tasks even if they don't share the cpu.
This is what I was seeing in practice.

But with Uact / Umax, Uact can be greater than Umax and this caused
some issues like tasks not getting their reserved bandwidth. For eg:
4 tasks with (7,10) on a 3 cpu system - one cpu can have Uact of 1.4
and scaled_delta to be greater than delta. This causes runtime to
deplete faster until one task is migrated. But after migration, the
target cpu will have this problem. So "Uact / Umax" was not working
in close to overcommit situations.

In summary "1 - Uinact" causes reclaiming much less while "Uact / Umax"
has issues during overcommitting of tasks with high bandwidth. This is
what I understood from experiments and reading.

> [...]
> > > I need to think a little bit more about this...
> > >
> > Thanks for looking into this.. I have a basic idea why tasks with less
> > bandwidth reclaim less in SMP when number of tasks is less than number
> > of cpus, but do not yet have a verifiable fix for it.
>
> I think I can now understand at least part of the problem. In my
> understanding, the problem is due to using
>         dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
>
> It should really be
>         dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt
>
> (since we divide by Umax, using "Umax - ..." will lead to reclaiming up
> to "Umax / Umax" = 1)
>
> Did you try this equation?
>
I had tested this and it was reclaiming much less compared to the first one.
I had 3 tasks with reservation (3,100) and 3 cpus.

With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06

With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65

As the task bandwidth goes higher, equation (2) reclaims more, but
equation (2) is a constant of 95% as long as number of tasks less
than cpus. If the number of tasks is more than cpus, eq (2) fares
better in reclaiming than eq (1)

eq (1) with 5 tasks (3,100)
TID[627]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
TID[626]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
TID[629]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.62
TID[628]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 29.00
TID[630]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.99

Here top shows 3 cpus in the range ~45 to 50% util

eq (2) with 5 tasks (3,100)
TID[667]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.20
TID[670]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.79
TID[668]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.11
TID[666]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 56.34
TID[669]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 55.82

And here top shows all 3 cpus with 95% util

> I'll write more about this later... And thanks for coping with all my
> comments!
>
Thanks :-)

Vineeth