linux-kernel - Re: Questions about transition latency and LATENCY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240529104947.o3pnahmcm7wzi6jb@airbuntu>
Date: Wed, 29 May 2024 11:49:47 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Viresh Kumar <viresh.kumar@...aro.org>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org
Subject: Re: Questions about transition latency and LATENCY_MULTIPLIER

Hi Viresh

On 05/29/24 12:39, Viresh Kumar wrote:
> HI Qais,
> 
> On 28-05-24, 02:21, Qais Yousef wrote:
> > Hi
> > 
> > I am trying to understanding the reason behind the usage of LATENCY_MULTIPLIER
> > to create transition_delay_us. It is set to 1000 by default and when I tried to
> > dig into the history I couldn't reach the original commit as the code has gone
> > through many transformations and I gave up finding the first commit that
> > introduced it.
> 
> The changes came along with the initial commits for conservative and ondemand
> governors, i.e. before 2005.

Thanks for the tip!

> 
> > Generally I am seeing that rate_limit_us in schedutil (which is largely
> > influenced by this multiplier on most/all systems I am working on) is too high
> > compared to the cpuinfo_transition_latency reported by the driver
> > 
> > For example on my M1 mac mini I get 50 and 56us. rate_limit_us is 10ms (on 6.8
> > kernel, should become 2ms after my fix)
> > 
> > 	$ grep . /sys/devices/system/cpu/cpufreq/policy*/cpuinfo_transition_latency
> > 	/sys/devices/system/cpu/cpufreq/policy0/cpuinfo_transition_latency:50000
> > 	/sys/devices/system/cpu/cpufreq/policy4/cpuinfo_transition_latency:56000
> > 
> > AMD Ryzen it reads 0, and end up with LATENCY_MULTIPLIER (1000 = 1ms) as
> > the rate_limit_us.
> > 
> > On Intel I5 I get 20us but rate_limit is 5ms which is requested explicitly by
> > intel_pstate driver
> > 
> > 	$ grep . /sys/devices/system/cpu/cpufreq/policy*/cpuinfo_transition_latency
> > 	/sys/devices/system/cpu/cpufreq/policy0/cpuinfo_transition_latency:20000
> > 	/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_transition_latency:20000
> > 	/sys/devices/system/cpu/cpufreq/policy2/cpuinfo_transition_latency:20000
> > 	/sys/devices/system/cpu/cpufreq/policy3/cpuinfo_transition_latency:20000
> > 	/sys/devices/system/cpu/cpufreq/policy4/cpuinfo_transition_latency:20000
> > 	/sys/devices/system/cpu/cpufreq/policy5/cpuinfo_transition_latency:20000
> > 	/sys/devices/system/cpu/cpufreq/policy6/cpuinfo_transition_latency:20000
> > 	/sys/devices/system/cpu/cpufreq/policy7/cpuinfo_transition_latency:20000
> > 
> > The question I have is that why so high? If hardware got so good, why can't we
> > leverage the hardware's fast ability to change frequencies more often?
> 
> From my understanding, this is about not changing the frequency too often.
> That's all. And it was historical and probably we didn't get better numbers with
> this reduced to a lower value later on as well.
> 
> > This is important because due to uclamp usage, we can end up with less gradual
> > transition between frequencies and we can jump up and down more often. And the
> > smaller this value is, this means the better we can handle fast transition to
> > boost or cap frequencies based on task's requirements when it context switches.
> > But the rate limit generally is too high for the hardware and wanted to
> > understand if this is pure historical or we still have reasons to worry about?
> 
> Maybe Rafael knows other reasons, but this is all I remember.
> 
> > From what I've seen so far, it seems to me this higher rate limit is helping
> > performance as bursty tasks are more likely to find the CPU running at higher
> > frequencies due to this behavior. I think this is something I can help these
> > bursty tasks with without relying accidentally on this being higher.
> > 
> > Is there any worry on using cpuinfo_transition_latency as is if the driver
> > doesn't provide transition_delay_us?
> 
> Won't we keep changing the frequency continuously in that case ? Or am I
> misunderstanding something ?

I have schedutil in mind, and it shouldn't. Other governors maybe. Should it be
up to the governor to scale this then?

For schedutil it shouldn't because utilization changes gradually. But we could
have events where tasks migrate between policies and if this task has big
util_avg then we can have big jumps. If this migration frequency is often, then
yeah we can end up with scenarios. But isn't this desired? We want the previous
policy to bring the frequency down ASAP to save power, and the new policy to go
up in frequency to accommodate for the new task.

Only issue I see is !fast_switch case schedutil needs to put some additional
delay due to kworker triggering and performing the actual request.

I haven't been looking at other governors to be honest. But if I am to propose
something I'll make sure they are not impacted.

> 
> > And does the kernel/driver contract need to cater for errors in driver's
> > ability to serve the request? Can our request silently be ignored by the
> > hardware?
> 
> cpufreq core maintains its state machine and the failures are used to inform the
> user and / or stop DVFS. It is useful for a clean approach, not sure what we
> will get / miss by ignoring the errors..

Ah, I am not requesting to ignore the error. I am worried it can be ignored
silently. Looks like this is not the case.

> 
> > Not necessarily due to rate limit being ignored, but for any other
> > reason? It is important for Linux to know what frequency we're actually running
> > at.
> 
> One is that we report to userspace two frequencies:
> - scaling_cur_freq: The frequency that the software thinks the hardware runs at
>   (last requested freq i.e.)
> 
> - cpuinfo_cur_freq: The real frequency hardware is running at. Can be calculated
>   using counters, etc.
> 
> And there will be tools which are using them. So these are required.

I was just trying to check with more frequent requests whether we are more
likely to encounter errors. And if we'd fail safe then as knowing the current
frequency is important for utilization invariance and EAS in general.

I'll look more at cpufreq core paths to verify. If you have big concerns please
let me know as I'm curious to explore how we can make things more responsive
but having heads up of the pitfalls would be much appreciated.

Thanks for the answers!


Cheers

--
Qais Yousef

> 
> > Some hardware gives the ability to read a counter to discover that. But
> > a lot of systems rely on the fact that the request we sent is actually
> > honoured. But failures can mean things like EAS will misbehave.
> 
> -- 
> viresh