linux-kernel - Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Yef8kTnlP5h4I7/1@FVFF7649Q05P>
Date:   Wed, 19 Jan 2022 11:59:44 +0000
From:   Vincent Donnefort <vincent.donnefort@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     peterz@...radead.org, mingo@...hat.com,
        linux-kernel@...r.kernel.org, dietmar.eggemann@....com,
        Valentin.Schneider@....com, Morten.Rasmussen@....com,
        Chris.Redpath@....com, qperret@...gle.com, Lukasz.Luba@....com
Subject: Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during
 migration

[...]

> > >
> > > This has several shortfalls:
> > > - have a look at cfs_rq_clock_pelt() and rq_clock_pelt(). What you
> > > name clock_pelt in your commit message and is used to update PELT and
> > > saved in se->avg.last_update_time is : rq->clock_pelt -
> > > rq->lost_idle_time - cfs_rq->throttled_clock_task_time
> >
> > That's why, the PELT "lag" is added onto se->avg.last_update_time. (see the last
> > paragraph of the commit message) The estimator is just a time delta, that is
> > added on top of the entity's last_update_time. I don't see any problem with the
> > lost_idle_time here.
> 
> lost_idle_time is updated before entering idle and after your
> clock_pelt_lag has been updated. This means that the delta that you
> are computing can be wrong
> 
> I haven't look in details but similar problem probably happens for
> throttled_clock_task_time
> 
> >
> > > - you are doing this whatever the state of the cpu : idle or not. But
> > > the clock cycles are not accounted for in the same way in both cases.
> >
> > If the CPU is idle and clock_pelt == clock_task, the component A of the
> > estimator would be 0 and we only would account for how outdated is the rq's
> > clock, i.e. component B.
> 
> And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task
> 
> >
> > > - (B) doesn't seem to be accurate as you skip irq and steal time
> > > accounting and you don't apply any scale invariance if the cpu is not
> > > idle
> >
> > The missing irq and paravirt time is the reason why it is called "estimator".
> > But maybe there's a chance of improving this part with a lockless version of
> > rq->prev_irq_time and rq->prev_steal_time_rq?
> >
> > > - IIUC your explanation in the commit message above, the (A) period
> > > seems to be a problem only when idle but you apply it unconditionally.
> >
> > If the CPU is idle (and clock_pelt == clock_task), only the B part would be
> > worth something:
> >
> >   A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
> >                       A                            B
> >
> > > If cpu is idle you can assume that clock_pelt should be equal to
> > > clock_task but you can't if cpu is not idle otherwise your sync will
> > > be inaccurate and defeat the primary goal of this patch. If your
> > > problem with clock_pelt is that the pending idle time is not accounted
> > > for when entering idle but only at the next update (update blocked
> > > load or wakeup of a thread). This patch below should fix this and
> > > remove your A.
> >
> > That would help slightly the current situation, but this part is already
> > covered by the estimator.
> 
> But the estimator, as you name it, is wrong beaus ethe A part can't be
> applied unconditionally

Hum, it is used only in the !active migration. So we know the task was sleeping
before that migration. As a consequence, the time we need to account is "sleeping"
time from the task point of view, which is clock_pelt == clock_task (for
__update_load_avg_blocked_se()). Otherwise, we would only decay with the
"wallclock" idle time instead of the "scaled" one wouldn't we?


     +-------------+-------------- 
     |   Task A    |    Task B    .....
              ^    ^             ^
              |    |          migrate A
	      |    |             |
              |    |             |
              |    |             |
	      |    |<----------->| 
              |  Wallclock Task A idle time
              |<---------------->|
	    "Scaled" Task A idle time


[...]