[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240628015200.vw75huo53redgkzf@airbuntu>
Date: Fri, 28 Jun 2024 02:52:00 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>,
Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Christian Loehle <christian.loehle@....com>,
Hongyan Xia <hongyan.xia2@....com>,
John Stultz <jstultz@...gle.com>, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6] sched: Consolidate cpufreq updates
On 06/25/24 14:58, Dietmar Eggemann wrote:
> > @@ -4917,6 +4927,84 @@ static inline void __balance_callbacks(struct rq *rq)
> >
> > #endif
> >
> > +static __always_inline void
> > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> > +{
> > +#ifdef CONFIG_CPU_FREQ
> > + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) {
> > + /* Sugov just did an update, don't be too aggressive */
> > + return;
> > + }
> > +
> > + /*
> > + * RT and DL should always send a freq update. But we can do some
> > + * simple checks to avoid it when we know it's not necessary.
> > + *
> > + * iowait_boost will always trigger a freq update too.
> > + *
> > + * Fair tasks will only trigger an update if the root cfs_rq has
> > + * decayed.
> > + *
> > + * Everything else should do nothing.
> > + */
> > + switch (current->policy) {
> > + case SCHED_NORMAL:
> > + case SCHED_BATCH:
>
> What about SCHED_IDLE tasks?
I didn't think they matter from cpufreq perspective. These tasks will just run
at whatever the idle system is happen to be at and have no specific perf
requirement since they should only run when the system is idle which a recipe
for starvation anyway?
>
> > + if (unlikely(current->in_iowait)) {
> > + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
> > + return;
> > + }
> > +
> > +#ifdef CONFIG_SMP
> > + if (unlikely(rq->cfs.decayed)) {
> > + rq->cfs.decayed = false;
> > + cpufreq_update_util(rq, 0);
> > + return;
> > + }
> > +#else
> > + cpufreq_update_util(rq, 0);
> > +#endif
>
> We can have !CONFIG_SMP and CONFIG_FAIR_GROUP_SCHED systems. Does this
> mean on those systems we call cpufreq_update_util() for each cfs_rq of
> the hierarchy where on CONFIG_SMP we only do this for the root cfs_rq?
No. This is called on context switch only and hierarchy doesn't matter here. We
just do it unconditionally for UP since we only track the decayed at cfs_rq
level and I didn't think it's worth trying to make it at rq level.
>
> [...]
>
> > @@ -4744,8 +4716,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> > __update_load_avg_se(now, cfs_rq, se);
> >
> > - decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > - decayed |= propagate_entity_load_avg(se);
> > + cfs_rq->decayed |= update_cfs_rq_load_avg(now, cfs_rq);
> > + cfs_rq->decayed |= propagate_entity_load_avg(se);
> >
> > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> >
> > @@ -4766,11 +4738,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > */
> > detach_entity_load_avg(cfs_rq, se);
> > update_tg_load_avg(cfs_rq);
> > - } else if (decayed) {
> > - cfs_rq_util_change(cfs_rq, 0);
> > -
> > - if (flags & UPDATE_TG)
> > - update_tg_load_avg(cfs_rq);
> > + } else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
> > + update_tg_load_avg(cfs_rq);
> > }
> > }
>
> You set cfs_rq->decayed for each taskgroup level but you only reset it
> for the root cfs_rq in __update_cpufreq_ctx_switch() and task_tick_fair()?
Yes. We only care about using it for root level. Tracking the information at
cfs_rq level is the most natural way to do it as this is what update_load_avg()
is acting on.
>
> [...]
>
> > @@ -8418,6 +8378,14 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> > if (pick_eevdf(cfs_rq) == pse)
> > goto preempt;
> >
> > +nopreempt:
> > +#ifdef CONFIG_SMP
> > + if (rq->cfs.decayed && rq->cfs.h_nr_running > 1)
> > + cpufreq_update_util(rq, SCHED_CPUFREQ_TASK_ENQUEUED);
>
> Why don't you set rq->cfs.decayed to false here as well?
>
> Is it because the request might fail in sugov_should_update_freq() in
> case 'delta_ns < sysctl_sched_base_slice'?
Yes. This call is likely to fail and we don't have a way to get a feedback to
know whether it went through or not.
FWIW I already have a patch that I considered sending along this submission
but opted not to as it'll make things too complicated.
But FWIW, I make cpufreq_update_util() return a bool to indicate whether
cpufreq update has happened or not and this helps to reset rq->cfs.decayed more
accurately in all call site like this one. But I think this should be another
independent patch.
Thanks!
--
Qais Yousef
Powered by blists - more mailing lists