linux-kernel - RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <004a01dac04c$314c4360$93e4ca20$@telus.net>
Date: Sun, 16 Jun 2024 17:20:43 -0700
From: "Doug Smythies" <dsmythies@...us.net>
To: "'Christian Loehle'" <christian.loehle@....com>,
	<rafael@...nel.org>
Cc: <vincent.guittot@...aro.org>,
	<qyousef@...alina.io>,
	<peterz@...radead.org>,
	<daniel.lezcano@...aro.org>,
	<ulf.hansson@...aro.org>,
	<anna-maria@...utronix.de>,
	<kajetan.puchalski@....com>,
	<lukasz.luba@....com>,
	<dietmar.eggemann@....com>,
	<linux-pm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>,
	"Doug Smythies" <dsmythies@...us.net>
Subject: RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic

On 2024.06.11 04:24 Christian Loehle wrote:

...
> Happy for anyone to take a look and test as well.
...

I tested the patch set.
I do a set of tests adopted over some years now.
Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more detail.
One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel 6.5-rc4.

Test system:
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
CPU Frequency scaling driver: intel_pstate
HWP (HardWare Pstate) control: Disabled
CPU frequency scaling governor: Performance
Idle states: 4: name : description:
   state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
   state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
   state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
   state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60
Ilde driver: intel_idle
Idle governor: as per individual test
Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
Legend:
   teo: unmodified 6.10-rc2
   menu: 
   ladder:
   cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
   clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
System is extremely idle, other than the test work.

Test 1: 2 core ping pong sweep:

Pass a token between 2 CPUs on 2 different cores.
Do a variable amount of work at each stop.

Purpose: To utilize the shallowest idle states
and observe the transition from using more of 1
idle state to another.

Results relative to teo (negative is better):
		menu		ladder		clv2		cl
average		-2.09%		11.11%		2.88%		1.81%
max		10.63%		33.83%		9.45%		10.13%
min		-11.58%	6.25%		-3.61%		-3.34%

While there are a few operating conditions where clv2 performs better than teo, overall it is worse.

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/

Test 2: 6 core ping pong sweep:

Pass a token between 6 CPUs on 6 different cores.
Do a variable amount of work at each stop.

Purpose: To utilize the midrange idle states
and observe the transitions between use of
idle states.

Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
transitioning between much less power and slower performance and much more power and higher performance.
On either side of this area, the differences between all idle governors are negligible.
Only data from before this area (from results 1 t0 95) was included in the below results.

Results relative to teo (negative is better):
		menu	ladder	cl	clv2
average		0.16%	4.32%	2.54%	2.64%
max		0.92%	14.32%	8.78%	8.50%
min		-0.44%	0.27%	0.09%	0.05%

One large clv2 difference seems to be excessive use of the deepest idle state,
with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
Example (20 second sample time):

teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
clv2 loop times were about 8% worse than teo.

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/

Test 3: sleeping ebizzy - 128 threads.

Purpose: This test has given interesting results in the past.
The test varies the sleep interval between record lookups.
The result is varying usage of idle states.

Results: relative to teo (negative is better):
		menu	clv2	ladder	cl
average		0.06%	0.38%	0.81%	0.35%
max		2.53%	3.20%	5.00%	2.87%
min		-2.13%	-1.66%	-3.30%	-2.13%

No strong conclusion here, from just the data.
However, clv2 seems to use a bit more processor power, on average.

Further details:

Test4: adrestia wakeup latency tests. 500 threads.

Purpose: The test was reported in 2023.09 by the kernel test robot and looked
both interesting and gave interesting results, so I added it to the tests I run.

Results:
teo:wakeup cost (periodic, 20us): 3130nSec reference
clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%

No strong conclusion here, from just the data.
However, clv2 seems to use a bit more processor power, on average.
teo: 69.72 watts
clv2: 72.91 watts +4.6%
Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.
Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results than the other tests.

There were several other tests run but are not written up herein.