[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d510f41a-1225-46d9-a2d7-ff9e6ff599d2@arm.com>
Date: Thu, 4 Jul 2024 12:12:30 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Qais Yousef <qyousef@...alina.io>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>, Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Christian Loehle <christian.loehle@....com>,
Hongyan Xia <hongyan.xia2@....com>, John Stultz <jstultz@...gle.com>,
linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6] sched: Consolidate cpufreq updates
On 28/06/2024 03:52, Qais Yousef wrote:
> On 06/25/24 14:58, Dietmar Eggemann wrote:
>
>>> @@ -4917,6 +4927,84 @@ static inline void __balance_callbacks(struct rq *rq)
>>>
>>> #endif
>>>
>>> +static __always_inline void
>>> +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
>>> +{
>>> +#ifdef CONFIG_CPU_FREQ
>>> + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) {
>>> + /* Sugov just did an update, don't be too aggressive */
>>> + return;
>>> + }
>>> +
>>> + /*
>>> + * RT and DL should always send a freq update. But we can do some
>>> + * simple checks to avoid it when we know it's not necessary.
>>> + *
>>> + * iowait_boost will always trigger a freq update too.
>>> + *
>>> + * Fair tasks will only trigger an update if the root cfs_rq has
>>> + * decayed.
>>> + *
>>> + * Everything else should do nothing.
>>> + */
>>> + switch (current->policy) {
>>> + case SCHED_NORMAL:
>>> + case SCHED_BATCH:
>>
>> What about SCHED_IDLE tasks?
>
> I didn't think they matter from cpufreq perspective. These tasks will just run
> at whatever the idle system is happen to be at and have no specific perf
> requirement since they should only run when the system is idle which a recipe
> for starvation anyway?
Not sure we talk about the same thing here? idle_sched_class vs.
SCHED_IDLE policy (FAIR task with a tiny weight of WEIGHT_IDLEPRIO).
>>> + if (unlikely(current->in_iowait)) {
>>> + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
>>> + return;
>>> + }
>>> +
>>> +#ifdef CONFIG_SMP
>>> + if (unlikely(rq->cfs.decayed)) {
>>> + rq->cfs.decayed = false;
>>> + cpufreq_update_util(rq, 0);
>>> + return;
>>> + }
>>> +#else
>>> + cpufreq_update_util(rq, 0);
>>> +#endif
>>
>> We can have !CONFIG_SMP and CONFIG_FAIR_GROUP_SCHED systems. Does this
>> mean on those systems we call cpufreq_update_util() for each cfs_rq of
>> the hierarchy where on CONFIG_SMP we only do this for the root cfs_rq?
>
> No. This is called on context switch only and hierarchy doesn't matter here. We
> just do it unconditionally for UP since we only track the decayed at cfs_rq
> level and I didn't think it's worth trying to make it at rq level.
OK, I see. The call in __update_cpufreq_ctx_switch() plus
(task_tick_fair() and check_preempt_wakeup_fair()) are not related to a
cfs_rq, but rather to the rq and/or task directly.
Currently we have the thing in update_load_avg() for !CONFIG_SMP and
there we use cfs_rq_util_change() which only calls cpufreq_update_util()
for root cfs_rq but this clearly has a cfs_rq context.
>> [...]
>>
>>> @@ -4744,8 +4716,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>> if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
>>> __update_load_avg_se(now, cfs_rq, se);
>>>
>>> - decayed = update_cfs_rq_load_avg(now, cfs_rq);
>>> - decayed |= propagate_entity_load_avg(se);
>>> + cfs_rq->decayed |= update_cfs_rq_load_avg(now, cfs_rq);
>>> + cfs_rq->decayed |= propagate_entity_load_avg(se);
>>>
>>> if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>>>
>>> @@ -4766,11 +4738,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>> */
>>> detach_entity_load_avg(cfs_rq, se);
>>> update_tg_load_avg(cfs_rq);
>>> - } else if (decayed) {
>>> - cfs_rq_util_change(cfs_rq, 0);
>>> -
>>> - if (flags & UPDATE_TG)
>>> - update_tg_load_avg(cfs_rq);
>>> + } else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
>>> + update_tg_load_avg(cfs_rq);
>>> }
>>> }
>>
>> You set cfs_rq->decayed for each taskgroup level but you only reset it
>> for the root cfs_rq in __update_cpufreq_ctx_switch() and task_tick_fair()?
>
> Yes. We only care about using it for root level. Tracking the information at
> cfs_rq level is the most natural way to do it as this is what update_load_avg()
> is acting on.
But IMHO this creates an issue with those non-root cfs_rq's within
update_load_avg() itself. They will stay decayed after cfs_rq->decayed
has been set to 1 once and will never be reset to 0. So with UPDATE_TG
update_tg_load_avg() will then always be called on those non-root
cfs_rq's all the time.
[...]
Powered by blists - more mailing lists