linux-kernel - Re: [PATCH 1/2] sched/schedutil: rework performance estimation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <50c3fc7c-7f0a-4acb-b6da-b39dc123d9b9@arm.com>
Date:   Wed, 18 Oct 2023 13:20:15 +0100
From:   Lukasz Luba <lukasz.luba@....com>
To:     Beata Michalska <beata.michalska@....com>,
        Vincent Guittot <vincent.guittot@...aro.org>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
        rafael@...nel.org, viresh.kumar@...aro.org, qyousef@...alina.io,
        linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org
Subject: Re: [PATCH 1/2] sched/schedutil: rework performance estimation



On 10/18/23 08:04, Beata Michalska wrote:
> Hi Vincent,
> 
> On Fri, Oct 13, 2023 at 05:14:49PM +0200, Vincent Guittot wrote:
>> The current method to take into account uclamp hints when estimating the
>> target frequency can end into situation where the selected target
>> frequency is finally higher than uclamp hints whereas there are no real
>> needs. Such cases mainly happen because we are currently mixing the
>> traditional scheduler utilization signal with the uclamp performance
>> hints. By adding these 2 metrics, we loose an important information when
>> it comes to select the target frequency and we have to make some
>> assumptions which can't fit all cases.

[snip]

>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index b9caa01dfac4..adec808b371a 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>   	scale_cpu = arch_scale_cpu_capacity(cpu);
>>   	ps = &pd->table[pd->nr_perf_states - 1];
>>   
>> -	max_util = map_util_perf(max_util);
> Even though the effective_cpu_util does no longer include the headroom, it is
> being applied by sugov further down the line (sugov_effective_cpu_perf).
> Won't that bring back the original problem when freq selection within EM is
> not align with the one performed by sugov ?

It should be OK here to remove the above line. The map_util_perf()
is done before this em_cpu_energy() call, in the new code in


[snip]

>>   }
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 922905194c0c..d4f7b2f49c44 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7628,7 +7628,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
>>   	for_each_cpu(cpu, pd_cpus) {
>>   		unsigned long util = cpu_util(cpu, p, -1, 0);
>>   
>> -		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
>> +		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
>>   	}
>>   
>>   	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
>> @@ -7651,7 +7651,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
>>   	for_each_cpu(cpu, pd_cpus) {
>>   		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
>>   		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
>> -		unsigned long eff_util;
>> +		unsigned long eff_util, min, max;
>>   
>>   		/*
>>   		 * Performance domain frequency: utilization clamping
>> @@ -7660,7 +7660,23 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
>>   		 * NOTE: in case RT tasks are running, by default the
>>   		 * FREQUENCY_UTIL's utilization can be max OPP.
>>   		 */
>> -		eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
>> +		eff_util = effective_cpu_util(cpu, util, &min, &max);
>> +
>> +		/* Task's uclamp can modify min and max value */
>> +		if (tsk && uclamp_is_used()) {
>> +			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
>> +
>> +			/*
>> +			 * If there is no active max uclamp constraint,
>> +			 * directly use task's one otherwise keep max
>> +			 */
>> +			if (uclamp_rq_is_idle(cpu_rq(cpu)))
>> +				max = uclamp_eff_value(p, UCLAMP_MAX);
>> +			else
>> +				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
>> +		}
>> +
>> +		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
> This will include the headroom so won't it inflate the util here ?

Yes, that's the goal. It will inflate when needed. Currently, the
problem is that we always inflate (blindly) in the em_cpu_energy().
We don't know if the util value which is comming is from uclamp_max
and the frequency should not be higher, because something want to
clamp it.

The other question would be:
What if the PD has 4 CPUs, the max util found is 500 and is from
uclamp_max, but there is onother util on some other CPU 490?

That CPU is allowed and can have the +20% freq in voting.
In current design we don't punish the whole domain in such
scenario.