[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCK3x63Gbv5nRfs_EOZeNUDSZtqkUXPBXYQ0D6Br1Rb=w@mail.gmail.com>
Date: Thu, 21 Apr 2022 09:36:08 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Vincent Donnefort <Vincent.Donnefort@....com>
Cc: peterz@...radead.org, mingo@...hat.com,
linux-kernel@...r.kernel.org, dietmar.eggemann@....com,
Morten.Rasmussen@....com, Chris.Redpath@....com, qperret@...gle.com
Subject: Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration
On Wed, 20 Apr 2022 at 11:34, Vincent Donnefort
<vincent.donnefort@....com> wrote:
>
> On 19/04/2022 17:27, Vincent Guittot wrote:
> > Le mardi 19 avril 2022 � 13:23:27 (+0100), Vincent Donnefort a �crit :
> >>
> >>
> >> On 19/04/2022 11:08, Vincent Guittot wrote:
> >>> On Tue, 12 Apr 2022 at 15:42, Vincent Donnefort
> >>> <vincent.donnefort@....com> wrote:
> >>>>
> >>>> Before being migrated to a new CPU, a task sees its PELT values
> >>>> synchronized with rq last_update_time. Once done, that same task will also
> >>>> have its sched_avg last_update_time reset. This means the time between
> >>>> the migration and the last clock update (B) will not be accounted for in
> >>>> util_avg and a discontinuity will appear. This issue is amplified by the
> >>>> PELT clock scaling. If the clock hasn't been updated while the CPU is
> >>>> idle, clock_pelt will not be aligned with clock_task and that time (A)
> >>>> will be also lost.
> >>>>
> >>>> ---------|----- A -----|-----------|------- B -----|>
> >>>> clock_pelt clock_task clock now
> >>>>
> >>>> This is especially problematic for asymmetric CPU capacity systems which
> >>>> need stable util_avg signals for task placement and energy estimation.
> >>>>
> >>>> Ideally, this problem would be solved by updating the runqueue clocks
> >>>> before the migration. But that would require taking the runqueue lock
> >>>> which is quite expensive [1]. Instead estimate the missing time and update
> >>>> the task util_avg with that value:
> >>>>
> >>>> A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
> >>>>
> >>>> Neither clock_task, clock_pelt nor clock can be accessed without the
> >>>> runqueue lock. The new cfs_rq last_update_lag is therefore created and
> >>>> contains those three values when the last_update_time value for that very
> >>>> same cfs_rq is updated.
> >>>>
> >>>> last_update_lag = clock - clock_task + clock_pelt
> >>>>
> >>>> And we can then write the missing time as follow:
> >>>>
> >>>> A + B = sched_clock_cpu() - last_update_lag
> >>>>
> >>>> The B. part of the missing time is however an estimation that doesn't take
> >>>> into account IRQ and Paravirt time.
> >>>>
> >>>> Now we have an estimation for A + B, we can create an estimator for the
> >>>> PELT value at the time of the migration. We need for this purpose to
> >>>> inject last_update_time which is a combination of both clock_pelt and
> >>>> lost_idle_time. The latter is a time value which is completely lost from a
> >>>> PELT point of view and must be ignored. And finally, we can write:
> >>>>
> >>>> now = last_update_time + A + B
> >>>> = last_update_time + sched_clock_cpu() - last_update_lag
> >>>>
> >>>> This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
> >>>> usage to the case where the source CPU is idle as we know this is when the
> >>>> clock is having the biggest risk of being outdated.
> >>>>
> >>>> [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
> >>>>
> >>>> Signed-off-by: Vincent Donnefort <vincent.donnefort@....com>
> >
> > [...]
> >
> >>>
> >>> I'm worried that we will call this for each and every
> >>> update_cfs_rq_load_avg() whereas the content will be used only when
> >>> idle and not throttled. Can't we use these conditions to save values
> >>> only when needed and limit the number of useless updates ?
> >>
> >> I don't think we can use idle here as a condition, once it is idle, it is
> >> too late to get those clock values.
> >
> > As an example, the patch below should work. It doesn't handle the throttled case yet and still has to
> > make sure that rq->enter_idle and rq->clock_pelt_idle are coherent in regards to ILB that
> > update blocked load.
>
>
> I had to abandon the per-rq approach from v1 to v2. This is because of
> the following example:
>
> 1. task A sleep
> 2. rq's clock updated (e.g another task runs)
> 3. task A migrated
>
> With a per-rq lag, we would miss the time delta between 1 and 2. We know
> how old is the last clock update. But what we actually want is how old
> is the task's last_update_time.
I don't get your point here. Even after 2, won't task A be correctly
updated with the patch below ? As I mentioned, this doesn't take into
account case where cfs can be throttled (ie CFS_BANDWIDTH)
But applying similar scheme should work
>
> >
> > ---
> > kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++
> > kernel/sched/pelt.h | 21 ++++++++++++++-------
> > kernel/sched/sched.h | 3 ++-
> > 3 files changed, 46 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e6ecf530f19f..f00843f9dd01 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7005,6 +7005,35 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >
> > static void detach_entity_cfs_rq(struct sched_entity *se);
> >
> > +#ifdef CONFIG_NO_HZ_COMMON
> > +static inline void migrate_se_pelt_lag(struct sched_entity *se)
> > +{
> > + u64 now;
> > + struct cfs_rq *cfs_rq;
> > + struct rq *rq;
> > + bool is_idle;
> > +
> > + cfs_rq = cfs_rq_of(se);
> > + rq = rq_of(cfs_rq);
> > +
> > + rcu_read_lock();
> > + is_idle = is_idle_task(rcu_dereference(rq->curr));
> > + rcu_read_unlock();
> > +
> > + if (!is_idle)
> > + return;
> > +
> > + /* TODO handle throttled cfs */
> > + /* TODO handle update ilb blocked load update */
> > + now = READ_ONCE(rq->clock_pelt_idle);
> > + now += sched_clock_cpu(cpu_of(rq)) - READ_ONCE(rq->enter_idle);
> > +
> > + __update_load_avg_blocked_se(now, se);
> > +}
> > +#else
> > +static void migrate_se_pelt_lag(struct sched_entity *se) {}
> > +#endif
> > +
> > /*
> > * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
> > * cfs_rq_of(p) references at time of call are still valid and identify the
> > @@ -7056,6 +7085,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> > * sounds not bad.
> > */
> > remove_entity_load_avg(&p->se);
> > + migrate_se_pelt_lag(&p->se);
> > }
> >
> > /* Tell new CPU we are migrated */
> > diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> > index c336f5f481bc..ece4423026e5 100644
> > --- a/kernel/sched/pelt.h
> > +++ b/kernel/sched/pelt.h
> > @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
> > WRITE_ONCE(avg->util_est.enqueued, enqueued);
> > }
> >
> > +static inline u64 rq_clock_pelt(struct rq *rq)
> > +{
> > + lockdep_assert_rq_held(rq);
> > + assert_clock_updated(rq);
> > +
> > + return rq->clock_pelt - rq->lost_idle_time;
> > +}
> > +
> > /*
> > * The clock_pelt scales the time to reflect the effective amount of
> > * computation done during the running delta time but then sync back to
> > @@ -78,6 +86,8 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
> > if (unlikely(is_idle_task(rq->curr))) {
> > /* The rq is idle, we can sync to clock_task */
> > rq->clock_pelt = rq_clock_task(rq);
> > + WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
> > + WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
> > return;
> > }
> >
> > @@ -130,14 +140,11 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> > */
> > if (util_sum >= divider)
> > rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> > -}
> >
> > -static inline u64 rq_clock_pelt(struct rq *rq)
> > -{
> > - lockdep_assert_rq_held(rq);
> > - assert_clock_updated(rq);
> > -
> > - return rq->clock_pelt - rq->lost_idle_time;
> > + /* The rq is idle, we can sync with clock_task */
> > + rq->clock_pelt = rq_clock_task(rq);
> > + WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
> > + WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
> > }
> >
> > #ifdef CONFIG_CFS_BANDWIDTH
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 6ab77b171656..108698446762 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1007,7 +1007,8 @@ struct rq {
> > u64 clock_task ____cacheline_aligned;
> > u64 clock_pelt;
> > unsigned long lost_idle_time;
> > -
> > + u64 clock_pelt_idle;
> > + u64 enter_idle;
> > atomic_t nr_iowait;
> >
> > #ifdef CONFIG_SCHED_DEBUG
Powered by blists - more mailing lists