[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240705002205.nnrgq7savzvsoqgl@airbuntu>
Date: Fri, 5 Jul 2024 01:22:05 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>,
Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Christian Loehle <christian.loehle@....com>,
Hongyan Xia <hongyan.xia2@....com>,
John Stultz <jstultz@...gle.com>, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6] sched: Consolidate cpufreq updates
On 07/04/24 12:12, Dietmar Eggemann wrote:
> On 28/06/2024 03:52, Qais Yousef wrote:
> > On 06/25/24 14:58, Dietmar Eggemann wrote:
> >
> >>> @@ -4917,6 +4927,84 @@ static inline void __balance_callbacks(struct rq *rq)
> >>>
> >>> #endif
> >>>
> >>> +static __always_inline void
> >>> +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> >>> +{
> >>> +#ifdef CONFIG_CPU_FREQ
> >>> + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) {
> >>> + /* Sugov just did an update, don't be too aggressive */
> >>> + return;
> >>> + }
> >>> +
> >>> + /*
> >>> + * RT and DL should always send a freq update. But we can do some
> >>> + * simple checks to avoid it when we know it's not necessary.
> >>> + *
> >>> + * iowait_boost will always trigger a freq update too.
> >>> + *
> >>> + * Fair tasks will only trigger an update if the root cfs_rq has
> >>> + * decayed.
> >>> + *
> >>> + * Everything else should do nothing.
> >>> + */
> >>> + switch (current->policy) {
> >>> + case SCHED_NORMAL:
> >>> + case SCHED_BATCH:
> >>
> >> What about SCHED_IDLE tasks?
> >
> > I didn't think they matter from cpufreq perspective. These tasks will just run
> > at whatever the idle system is happen to be at and have no specific perf
> > requirement since they should only run when the system is idle which a recipe
> > for starvation anyway?
>
> Not sure we talk about the same thing here? idle_sched_class vs.
> SCHED_IDLE policy (FAIR task with a tiny weight of WEIGHT_IDLEPRIO).
Yes I am referring to SCHED_IDLE policy too. What is your expectation? AFAIK
the goal of this policy to run when there's nothing else needs running.
>
> >>> + if (unlikely(current->in_iowait)) {
> >>> + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
> >>> + return;
> >>> + }
> >>> +
> >>> +#ifdef CONFIG_SMP
> >>> + if (unlikely(rq->cfs.decayed)) {
> >>> + rq->cfs.decayed = false;
> >>> + cpufreq_update_util(rq, 0);
> >>> + return;
> >>> + }
> >>> +#else
> >>> + cpufreq_update_util(rq, 0);
> >>> +#endif
> >>
> >> We can have !CONFIG_SMP and CONFIG_FAIR_GROUP_SCHED systems. Does this
> >> mean on those systems we call cpufreq_update_util() for each cfs_rq of
> >> the hierarchy where on CONFIG_SMP we only do this for the root cfs_rq?
> >
> > No. This is called on context switch only and hierarchy doesn't matter here. We
> > just do it unconditionally for UP since we only track the decayed at cfs_rq
> > level and I didn't think it's worth trying to make it at rq level.
>
> OK, I see. The call in __update_cpufreq_ctx_switch() plus
> (task_tick_fair() and check_preempt_wakeup_fair()) are not related to a
> cfs_rq, but rather to the rq and/or task directly.
>
> Currently we have the thing in update_load_avg() for !CONFIG_SMP and
> there we use cfs_rq_util_change() which only calls cpufreq_update_util()
> for root cfs_rq but this clearly has a cfs_rq context.
>
> >> [...]
> >>
> >>> @@ -4744,8 +4716,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >>> if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> >>> __update_load_avg_se(now, cfs_rq, se);
> >>>
> >>> - decayed = update_cfs_rq_load_avg(now, cfs_rq);
> >>> - decayed |= propagate_entity_load_avg(se);
> >>> + cfs_rq->decayed |= update_cfs_rq_load_avg(now, cfs_rq);
> >>> + cfs_rq->decayed |= propagate_entity_load_avg(se);
> >>>
> >>> if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> >>>
> >>> @@ -4766,11 +4738,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >>> */
> >>> detach_entity_load_avg(cfs_rq, se);
> >>> update_tg_load_avg(cfs_rq);
> >>> - } else if (decayed) {
> >>> - cfs_rq_util_change(cfs_rq, 0);
> >>> -
> >>> - if (flags & UPDATE_TG)
> >>> - update_tg_load_avg(cfs_rq);
> >>> + } else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
> >>> + update_tg_load_avg(cfs_rq);
> >>> }
> >>> }
> >>
> >> You set cfs_rq->decayed for each taskgroup level but you only reset it
> >> for the root cfs_rq in __update_cpufreq_ctx_switch() and task_tick_fair()?
> >
> > Yes. We only care about using it for root level. Tracking the information at
> > cfs_rq level is the most natural way to do it as this is what update_load_avg()
> > is acting on.
>
> But IMHO this creates an issue with those non-root cfs_rq's within
I am not seeing the issue, could you expand on what is it?
> update_load_avg() itself. They will stay decayed after cfs_rq->decayed
> has been set to 1 once and will never be reset to 0. So with UPDATE_TG
> update_tg_load_avg() will then always be called on those non-root
> cfs_rq's all the time.
We could add a check to update only the root cfs_rq. But what do we gain? Or
IOW, what is the harm of unconditionally updating cfs_rq->decayed given that we
only care about the root cfs_rq? I see more if conditions and branches which
I am trying to avoid.
Powered by blists - more mailing lists