linux-kernel - RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <005701dac1a4$6ae1c830$40a55890$@telus.net>
Date: Tue, 18 Jun 2024 10:24:46 -0700
From: "Doug Smythies" <dsmythies@...us.net>
To: "'Christian Loehle'" <christian.loehle@....com>
Cc: <rafael@...nel.org>,
	<vincent.guittot@...aro.org>,
	<qyousef@...alina.io>,
	<peterz@...radead.org>,
	<daniel.lezcano@...aro.org>,
	<ulf.hansson@...aro.org>,
	<anna-maria@...utronix.de>,
	<kajetan.puchalski@....com>,
	<lukasz.luba@....com>,
	<dietmar.eggemann@....com>,
	<linux-pm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>,
	"Doug Smythies" <dsmythies@...us.net>
Subject: RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic

Hi Christian,

Thank you for your reply.

On 2024.06.18 03:54 Christian Loehle wrote:
> On Sun, Jun 16, 2024 at 05:20:43PM -0700, Doug Smythies wrote:
>> On 2024.06.11 04:24 Christian Loehle wrote:
>>
>> ...
>> > Happy for anyone to take a look and test as well.
>> ...
>>
>> I tested the patch set.
>> I do a set of tests adopted over some years now.
>> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more
detail.
>> One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel
6.5-rc4.
>>
> 
> Thank you very much Doug, that is helpful indeed!
> 
>> Test system:
>> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
>> CPU Frequency scaling driver: intel_pstate
>> HWP (HardWare Pstate) control: Disabled
>> CPU frequency scaling governor: Performance
>> Idle states: 4: name : description:
>>    state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
>>    state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
>>    state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
>>    state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60
> 
> What are target residencies and exit latencies?

Of course. Here:

/sys/devices/system/cpu/cpu1/cpuidle/state0/residency:0
/sys/devices/system/cpu/cpu1/cpuidle/state1/residency:1
/sys/devices/system/cpu/cpu1/cpuidle/state2/residency:360
/sys/devices/system/cpu/cpu1/cpuidle/state3/residency:3102

/sys/devices/system/cpu/cpu1/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu1/cpuidle/state1/latency:1
/sys/devices/system/cpu/cpu1/cpuidle/state2/latency:120
/sys/devices/system/cpu/cpu1/cpuidle/state3/latency:1034
 
>> Ilde driver: intel_idle
>> Idle governor: as per individual test
>> Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
>> Legend:
>>    teo: unmodified 6.10-rc2
>>    menu:
>>    ladder:
>>    cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
>>    clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
>> System is extremely idle, other than the test work.
> 
> If you don't mind spinning up another one, I'd be very curious about
> results from just the Util-awareness revert (i.e. v2 1/3).
> If not I'll try to reproduce your tests.

I will, but not today.
I have never been a fan of Util-awareness.

>> Test 1: 2 core ping pong sweep:
>>
>> Pass a token between 2 CPUs on 2 different cores.
>> Do a variable amount of work at each stop.
> 
> Hard to interpret the results here, as state residencies would be the
> most useful one, but from the results I assume that residencies are
> calculated over all possible CPUs, so 4/6 CPUs are pretty much idle
> the entire time, resulting in >75% state3 residency overall.

It would be 10 of 12 CPUs are idle and 4 of 6 cores.
But fair enough, the residency stats are being dominated by the idle CPUs.
I usually look at the usage in conjunction with the residency percentages.
At 10 minutes (20 second sample period):
teo entered idle state 3 517 times ; clv2 was 1,979,541 times
At 20 minutes:
teo entered idle state 3 525 times ; clv2 was 3,011,263 times
Anyway, I could hack something to just use data from the 2 CPUs involved.

>> Purpose: To utilize the shallowest idle states
>> and observe the transition from using more of 1
>> idle state to another.
>>
>> Results relative to teo (negative is better):
>>		menu		ladder		clv2		cl
>> average	-2.09%		11.11%		2.88%		1.81%
>> max		10.63%		33.83%		9.45%		10.13%
>> min		-11.58%		6.25%		-3.61%		-3.34%
>>
>> While there are a few operating conditions where clv2 performs better than teo, overall it is worse.
>>
>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/
>>
>> Test 2: 6 core ping pong sweep:
>>
>> Pass a token between 6 CPUs on 6 different cores.
>> Do a variable amount of work at each stop.
>>
> 
> My first guess would've been that this is the perfect workload for the
> very low utilization threshold, but even teo has >40% state3 residency
> consistently here.

There are still 6 idle CPUs.
I'll try a 12 CPUs using each core twice type sweep test,
but I think I settled on 6 because it focused on what I wanted for results.

>> Purpose: To utilize the midrange idle states
>> and observe the transitions between use of
>> idle states.
>>
>> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
>> transitioning between much less power and slower performance and much more power and higher performance.
>> On either side of this area, the differences between all idle governors are negligible.
>> Only data from before this area (from results 1 t0 95) was included in the below results.
> 
> I see and agree with your interpretation. Difference in power between
> all tested seems to be negligible during that window. Interestingly
> the residencies of idle states seem to be very different, like ladder
> being mostly in deepest state3. Maybe total package power is too coarse
> to show the differences for this test.
> 
>> Results relative to teo (negative is better):
>>		menu	ladder	cl	clv2
>> average	0.16%	4.32%	2.54%	2.64%
>> max		0.92%	14.32%	8.78%	8.50%
>> min		-0.44%	0.27%	0.09%	0.05%
>>
>> One large clv2 difference seems to be excessive use of the deepest idle state,
>> with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
>> Example (20 second sample time):
>>
>> teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
>> clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
>> clv2 loop times were about 8% worse than teo.
> 
> Some of the idle state 3 residencies seem to be >100% at the end here,
> not sure what's up with that.
 
The test is over and the system is completely idle.
And yes, there are 4 calculations that come out > 100%, the worst being 100.71%,
with a total sum over all idle states of 100.79%.
I can look into it if you want but have never expected the numbers to be that accurate.

>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/
>>
>> Test 3: sleeping ebizzy - 128 threads.
>>
>> Purpose: This test has given interesting results in the past.
>> The test varies the sleep interval between record lookups.
>> The result is varying usage of idle states.
>>
>> Results: relative to teo (negative is better):
>>		menu	clv2	ladder	cl
>> average	0.06%	0.38%	0.81%	0.35%
>> max		2.53%	3.20%	5.00%	2.87%
>> min		-2.13%	-1.66%	-3.30%	-2.13%
>>
>> No strong conclusion here, from just the data.
>> However, clv2 seems to use a bit more processor power, on average.
> 
> Not sure about that, from the residencies ladder and teo should be
> decisive losers in terms of power. While later in the test teo seems
> to be getting worse in power it doesn't quite reflect the difference
> in states.
> E.g. clv2 finishing with 65% state2 residency while teo has 40%, but
> I'll try to get per-CPU power measurements on this one.
> Interestingly ladder is a clear winner if anything, if that is reliable
> as a result that could indicate a too aggressive tick stop from the
> other governors, but cl isn't that much better than clv2 here, even
> though it stops the tick less aggressively.

I agree with what you are saying.
It is a shorter test at only 25 minutes.
It might be worth trying the test again with more strict attention to
stabilizing the system thermally before each test.
The processor power will vary by a few watts for the exact same load
as a function of processor package temperature and coolant (my system is
water cooled) temperature and can take 20 to 30 minutes to settle.

Reference:
http://smythies.com/~doug/linux/idle/teo-util3/temperature/thermal-stabalization-time.png

>>
>> Further details:
> 
> Link is missing, but I found
> http://smythies.com/~doug/linux/idle/teo-util3/ebizzy/
> from browsing your page.

Yes, I accidently hit "Send" on my original email before it was actually finished.
But, then I was tired and thought "close enough".

>> Test4: adrestia wakeup latency tests. 500 threads.
>>
>> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
>> both interesting and gave interesting results, so I added it to the tests I run.
> 
> http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/
> So interestingly we can see, what I would call, the misbehavior of teo
> here, with teo skipping state2 and state3 entirely. You would expect
> a power regression here, but it doesn't translate into package power
> anyway.
> 
>>
>> Results:
>> teo:wakeup cost (periodic, 20us): 3130nSec reference
>> clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
>> cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
>> menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
>> ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%
> 
> Is this measured as wakeup latency?
> I can't find much info about the test setup here, do you mind sharing
> something on it?

I admit to being vague on this one, and I'll need some time to review.
The notes I left for myself last September are here:
http://smythies.com/~doug/linux/idle/teo-util2/adrestia/README.txt
 
>> No strong conclusion here, from just the data.
>> However, clv2 seems to use a bit more processor power, on average.
>> teo: 69.72 watts
>> clv2: 72.91 watts +4.6%
>> Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.

which might not have been long enough, see the thermal notes above.

>> Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results
than the other tests.

>> There were several other tests run but are not written up herein.
>> 
> Because results are on par for all? Or inconclusive / not reproducible?

Yes, because nothing of significance was observed or the test was more or less a repeat of an already covered test.
Initially, I had a mistake in my baseline teo tests, and a couple of the not written up tests have still not been re-run with the
proper baseline.

> Some final words:
> I was hoping to get rid of Util-awareness with fixed the fixed intercept logic
> and my test showed that this isn't unreasonable.
> Here we do see a case where there is some regression vs Util-awareness.
> The intercept logic is currently decaying quite aggressively, maybe
> that could be tuned to improve teo behavior.
> 
> Kind Regards,
> Christian

... Doug