linux-kernel - Re: [PATCH v6] sched: Consolidate cpufreq updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d510f41a-1225-46d9-a2d7-ff9e6ff599d2@arm.com>
Date: Thu, 4 Jul 2024 12:12:30 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Qais Yousef <qyousef@...alina.io>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
 Viresh Kumar <viresh.kumar@...aro.org>, Ingo Molnar <mingo@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>,
 Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
 Daniel Bristot de Oliveira <bristot@...hat.com>,
 Valentin Schneider <vschneid@...hat.com>,
 Christian Loehle <christian.loehle@....com>,
 Hongyan Xia <hongyan.xia2@....com>, John Stultz <jstultz@...gle.com>,
 linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6] sched: Consolidate cpufreq updates

On 28/06/2024 03:52, Qais Yousef wrote:
> On 06/25/24 14:58, Dietmar Eggemann wrote:
> 
>>> @@ -4917,6 +4927,84 @@ static inline void __balance_callbacks(struct rq *rq)
>>>  
>>>  #endif
>>>  
>>> +static __always_inline void
>>> +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
>>> +{
>>> +#ifdef CONFIG_CPU_FREQ
>>> +	if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) {
>>> +		/* Sugov just did an update, don't be too aggressive */
>>> +		return;
>>> +	}
>>> +
>>> +	/*
>>> +	 * RT and DL should always send a freq update. But we can do some
>>> +	 * simple checks to avoid it when we know it's not necessary.
>>> +	 *
>>> +	 * iowait_boost will always trigger a freq update too.
>>> +	 *
>>> +	 * Fair tasks will only trigger an update if the root cfs_rq has
>>> +	 * decayed.
>>> +	 *
>>> +	 * Everything else should do nothing.
>>> +	 */
>>> +	switch (current->policy) {
>>> +	case SCHED_NORMAL:
>>> +	case SCHED_BATCH:
>>
>> What about SCHED_IDLE tasks?
> 
> I didn't think they matter from cpufreq perspective. These tasks will just run
> at whatever the idle system is happen to be at and have no specific perf
> requirement since they should only run when the system is idle which a recipe
> for starvation anyway?

Not sure we talk about the same thing here? idle_sched_class vs.
SCHED_IDLE policy (FAIR task with a tiny weight of WEIGHT_IDLEPRIO).

>>> +		if (unlikely(current->in_iowait)) {
>>> +			cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
>>> +			return;
>>> +		}
>>> +
>>> +#ifdef CONFIG_SMP
>>> +		if (unlikely(rq->cfs.decayed)) {
>>> +			rq->cfs.decayed = false;
>>> +			cpufreq_update_util(rq, 0);
>>> +			return;
>>> +		}
>>> +#else
>>> +		cpufreq_update_util(rq, 0);
>>> +#endif
>>
>> We can have !CONFIG_SMP and CONFIG_FAIR_GROUP_SCHED systems. Does this
>> mean on those systems we call cpufreq_update_util() for each cfs_rq of
>> the hierarchy where on CONFIG_SMP we only do this for the root cfs_rq?
> 
> No. This is called on context switch only and hierarchy doesn't matter here. We
> just do it unconditionally for UP since we only track the decayed at cfs_rq
> level and I didn't think it's worth trying to make it at rq level.

OK, I see. The call in __update_cpufreq_ctx_switch() plus
(task_tick_fair() and check_preempt_wakeup_fair()) are not related to a
cfs_rq, but rather to the rq and/or task directly.

Currently we have the thing in update_load_avg() for !CONFIG_SMP and
there we use cfs_rq_util_change() which only calls cpufreq_update_util()
for root cfs_rq but this clearly has a cfs_rq context.

>> [...]
>>
>>> @@ -4744,8 +4716,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>>  	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
>>>  		__update_load_avg_se(now, cfs_rq, se);
>>>  
>>> -	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
>>> -	decayed |= propagate_entity_load_avg(se);
>>> +	cfs_rq->decayed |= update_cfs_rq_load_avg(now, cfs_rq);
>>> +	cfs_rq->decayed |= propagate_entity_load_avg(se);
>>>  
>>>  	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>>>  
>>> @@ -4766,11 +4738,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>>  		 */
>>>  		detach_entity_load_avg(cfs_rq, se);
>>>  		update_tg_load_avg(cfs_rq);
>>> -	} else if (decayed) {
>>> -		cfs_rq_util_change(cfs_rq, 0);
>>> -
>>> -		if (flags & UPDATE_TG)
>>> -			update_tg_load_avg(cfs_rq);
>>> +	} else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
>>> +		update_tg_load_avg(cfs_rq);
>>>  	}
>>>  }
>>
>> You set cfs_rq->decayed for each taskgroup level but you only reset it
>> for the root cfs_rq in __update_cpufreq_ctx_switch() and task_tick_fair()?
> 
> Yes. We only care about using it for root level. Tracking the information at
> cfs_rq level is the most natural way to do it as this is what update_load_avg()
> is acting on.

But IMHO this creates an issue with those non-root cfs_rq's within
update_load_avg() itself. They will stay decayed after cfs_rq->decayed
has been set to 1 once and will never be reset to 0. So with UPDATE_TG
update_tg_load_avg() will then always be called on those non-root
cfs_rq's all the time.

[...]