[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0d354ce1-9b43-feae-1065-5d8a78c56ccc@arm.com>
Date: Tue, 19 Apr 2022 13:23:27 +0100
From: Vincent Donnefort <vincent.donnefort@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: peterz@...radead.org, mingo@...hat.com,
linux-kernel@...r.kernel.org, dietmar.eggemann@....com,
morten.rasmussen@....com, chris.redpath@....com, qperret@...gle.com
Subject: Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup
migration
On 19/04/2022 11:08, Vincent Guittot wrote:
> On Tue, 12 Apr 2022 at 15:42, Vincent Donnefort
> <vincent.donnefort@....com> wrote:
>>
>> Before being migrated to a new CPU, a task sees its PELT values
>> synchronized with rq last_update_time. Once done, that same task will also
>> have its sched_avg last_update_time reset. This means the time between
>> the migration and the last clock update (B) will not be accounted for in
>> util_avg and a discontinuity will appear. This issue is amplified by the
>> PELT clock scaling. If the clock hasn't been updated while the CPU is
>> idle, clock_pelt will not be aligned with clock_task and that time (A)
>> will be also lost.
>>
>> ---------|----- A -----|-----------|------- B -----|>
>> clock_pelt clock_task clock now
>>
>> This is especially problematic for asymmetric CPU capacity systems which
>> need stable util_avg signals for task placement and energy estimation.
>>
>> Ideally, this problem would be solved by updating the runqueue clocks
>> before the migration. But that would require taking the runqueue lock
>> which is quite expensive [1]. Instead estimate the missing time and update
>> the task util_avg with that value:
>>
>> A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
>>
>> Neither clock_task, clock_pelt nor clock can be accessed without the
>> runqueue lock. The new cfs_rq last_update_lag is therefore created and
>> contains those three values when the last_update_time value for that very
>> same cfs_rq is updated.
>>
>> last_update_lag = clock - clock_task + clock_pelt
>>
>> And we can then write the missing time as follow:
>>
>> A + B = sched_clock_cpu() - last_update_lag
>>
>> The B. part of the missing time is however an estimation that doesn't take
>> into account IRQ and Paravirt time.
>>
>> Now we have an estimation for A + B, we can create an estimator for the
>> PELT value at the time of the migration. We need for this purpose to
>> inject last_update_time which is a combination of both clock_pelt and
>> lost_idle_time. The latter is a time value which is completely lost from a
>> PELT point of view and must be ignored. And finally, we can write:
>>
>> now = last_update_time + A + B
>> = last_update_time + sched_clock_cpu() - last_update_lag
>>
>> This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
>> usage to the case where the source CPU is idle as we know this is when the
>> clock is having the biggest risk of being outdated.
>>
>> [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
>>
>> Signed-off-by: Vincent Donnefort <vincent.donnefort@....com>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5dd38c9df0cc..e234d015657f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3694,6 +3694,57 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>>
>> #endif /* CONFIG_FAIR_GROUP_SCHED */
>>
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +static inline void update_cfs_rq_lag(struct cfs_rq *cfs_rq)
>> +{
>> + struct rq *rq = rq_of(cfs_rq);
>> +
>> + u64_u32_store(cfs_rq->last_update_lag,
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + /* Timer stopped by throttling */
>> + unlikely(cfs_rq->throttle_count) ? U64_MAX :
>> +#endif
>> + rq->clock - rq->clock_task + rq->clock_pelt);
>
> I'm worried that we will call this for each and every
> update_cfs_rq_load_avg() whereas the content will be used only when
> idle and not throttled. Can't we use these conditions to save values
> only when needed and limit the number of useless updates ?
I don't think we can use idle here as a condition, once it is idle, it
is too late to get those clock values.
>
> A quick test with hackbench on a 8 cpus system gives
> around 60k call for a duration 550 msec run a root level
> and 180k from a 3rd level cgroups
If you think this is too much to have this store on this path, we could
though either:
a. restrict the feature to when we know it is the most important: EAS?
b. store the updated value only when "decayed" is non 0.
[...]
Powered by blists - more mailing lists