linux-kernel - Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAKfTPtC+F_KqGYVtS9u-VM+9TshSRZG9kW3uL-wHyh-hkaAQbw@mail.gmail.com>
Date:   Fri, 21 Jan 2022 16:27:23 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Vincent Donnefort <Vincent.Donnefort@....com>
Cc:     peterz@...radead.org, mingo@...hat.com,
        linux-kernel@...r.kernel.org, dietmar.eggemann@....com,
        Valentin.Schneider@....com, Morten.Rasmussen@....com,
        Chris.Redpath@....com, qperret@...gle.com, Lukasz.Luba@....com
Subject: Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration

On Thu, 20 Jan 2022 at 22:12, Vincent Donnefort
<vincent.donnefort@....com> wrote:
>
> [...]
>
> > > > And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task
> > > >
> > > > >
> > > > > > - (B) doesn't seem to be accurate as you skip irq and steal time
> > > > > > accounting and you don't apply any scale invariance if the cpu is not
> > > > > > idle
> > > > >
> > > > > The missing irq and paravirt time is the reason why it is called "estimator".
> > > > > But maybe there's a chance of improving this part with a lockless version of
> > > > > rq->prev_irq_time and rq->prev_steal_time_rq?
> > > > >
> > > > > > - IIUC your explanation in the commit message above, the (A) period
> > > > > > seems to be a problem only when idle but you apply it unconditionally.
> > > > >
> > > > > If the CPU is idle (and clock_pelt == clock_task), only the B part would be
> > > > > worth something:
> > > > >
> > > > >   A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
> > > > >                       A                            B
> > > > >
> > > > > > If cpu is idle you can assume that clock_pelt should be equal to
> > > > > > clock_task but you can't if cpu is not idle otherwise your sync will
> > > > > > be inaccurate and defeat the primary goal of this patch. If your
> > > > > > problem with clock_pelt is that the pending idle time is not accounted
> > > > > > for when entering idle but only at the next update (update blocked
> > > > > > load or wakeup of a thread). This patch below should fix this and
> > > > > > remove your A.
> > > > >
> > > > > That would help slightly the current situation, but this part is already
> > > > > covered by the estimator.
> > > >
> > > > But the estimator, as you name it, is wrong beaus ethe A part can't be
> > > > applied unconditionally
> > >
> > > Hum, it is used only in the !active migration. So we know the task was sleeping
> > > before that migration. As a consequence, the time we need to account is "sleeping"
> > > time from the task point of view, which is clock_pelt == clock_task (for
> > > __update_load_avg_blocked_se()). Otherwise, we would only decay with the
> > > "wallclock" idle time instead of the "scaled" one wouldn't we?
> >
> > clock_pelt == clock_task only when cpu is idle and after updating
> > lost_idle_time but you have no idea of the state of the cpu when
> > migrating the task
>
> I was just applying the time scaling at the task level. Why shall it depends on
> the CPU state?
>
> The situation would be as follows:
>
>                     <--X--> <--Y-->
>            +-------+-------+-------+
> CPUX    ___|   B   |   A   |   B   |___
>                                   ^
>                                migrate A
>
> In a such scenario, CPUX's PELT clock is indeed scaled. The Task A running
> time (X) has already been accounted, so what's left is to get an idle time (Y)
> contribution accurate. We would usually rely on the CPU being idle for the
> catch-up and that time would be Y + (X - scaled(X)). Without the catch-up, we
> would account at the migration, for the sleeping time Y, only (scaled(Y)). Applied
> to the same graph as for update_rq_clock_pelt()'s:
>
>  clock_task    | 1| 2| 3| 4| 5| 6| 7| 8|
>  clock_pelt    | 1   | 2   | 3   | 4   |  (CPU's running, clock_pelt is scaled)
>  expected      | 1   | 2   | 5| 6| 7| 8|
>                <---- X ---><--- Y ---->
>  Task A -------************----------
>                                    ^
>                                migrate A
>
> Contribution for Task A idle time at the migration (as we know we won't have
> another chance to catch-up clock_task later) should be 6, not 2, regardless of
> the CPU state.

If task A wakes up on the same CPU, we sync with the scaled clock_pelt
so why using something different here

>
> _But_ indeed, there would be a risk of hitting the lost_idle_time threshold and
> decay too much... (which is absolutely not handled in the current version). So
> now, if we don't want to bother too much, we could simplify the problem and
> say (which is true with NOHZ_IDLE) that if the CPU is running, the clock must
> not be that old anyway. So we should only care of the idle case, which is
> mitigated with your proposed snippet and I allow to get rid of the [A]
> part (clock_task - clock_pelt).
>
> As per sched_clock_cpu() usage, I haven't measured anything yet but notice it's
> already used in the wakeup path in ttwu_queue_wakelist().
>