linux-kernel - Re: [PATCH v7] sched: Consolidate cpufreq updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b1b1e173-5374-4463-a5e1-d1b8c1976fc7@arm.com>
Date: Thu, 10 Oct 2024 19:32:57 +0100
From: Christian Loehle <christian.loehle@....com>
To: Anjali K <anjalik@...ux.ibm.com>, Qais Yousef <qyousef@...alina.io>,
 "Rafael J. Wysocki" <rafael@...nel.org>,
 Viresh Kumar <viresh.kumar@...aro.org>, Ingo Molnar <mingo@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Juri Lelli <juri.lelli@...hat.com>
Cc: Steven Rostedt <rostedt@...dmis.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
 <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
 Valentin Schneider <vschneid@...hat.com>, Hongyan Xia
 <hongyan.xia2@....com>, John Stultz <jstultz@...gle.com>,
 linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v7] sched: Consolidate cpufreq updates

On 10/8/24 10:56, Christian Loehle wrote:
> On 10/7/24 18:20, Anjali K wrote:
>> Hi, I tested this patch to see if it causes any regressions on bare-metal power9 systems with microbenchmarks.
>> The test system is a 2 NUMA node 128 cpu powernv power9 system. The conservative governor is enabled.
>> I took the baseline as the 6.10.0-rc1 tip sched/core kernel.
>> No regressions were found.
>>
>> +------------------------------------------------------+--------------------+----------+
>> |                     Benchmark                        |      Baseline      | Baseline |
>> |                                                      |  (6.10.0-rc1 tip   | + patch  |
>> |                                                      |  sched/core)       |          |
>> +------------------------------------------------------+--------------------+----------+
>> |Hackbench run duration (sec)                          |         1          |   1.01   |
>> |Lmbench simple fstat (usec)                           |         1          |   0.99   |
>> |Lmbench simple open/close (usec)                      |         1          |   1.02   |
>> |Lmbench simple read (usec)                            |         1          |   1      |
>> |Lmbench simple stat (usec)                            |         1          |   1.01   |
>> |Lmbench simple syscall (usec)                         |         1          |   1.01   |
>> |Lmbench simple write (usec)                           |         1          |   1      |
>> |stressng (bogo ops)                                   |         1          |   0.94   |
>> |Unixbench execl throughput (lps)                      |         1          |   0.97   |
>> |Unixbench Pipebased Context Switching throughput (lps)|         1          |   0.94   |
>> |Unixbench Process Creation (lps)                      |         1          |   1      |
>> |Unixbench Shell Scripts (1 concurrent) (lpm)          |         1          |   1      |
>> |Unixbench Shell Scripts (8 concurrent) (lpm)          |         1          |   1.01   |
>> +------------------------------------------------------+--------------------+----------+
>>
>> Thank you,
>> Anjali K
>>
> 
> The default CPUFREQ_DBS_MIN_SAMPLING_INTERVAL is still to have 2 ticks between
> cpufreq updates on conservative/ondemand.
> What is your sampling_rate setting? What's your HZ?
> Interestingly the context switch heavy benchmarks still show -6% don't they?
> Do you mind trying schedutil with a reasonable rate_limit_us, too?


After playing with this a bit more I can see a ~6% regression on
workloads like hackbench too.
That is to around 80% because of the update in check_preempt_wakeup_fair(),
the rest because of the context-switch. Overall the number of
cpufreq_update_util() calls for hackbench -pTl 20000 increased by a
factor of 20-25x, removing the one in check_preempt_wakeup_fair() brings
this down to 10x. For other workloads the amount of
cpufreq_update_util() calls is in the same ballpark as mainline.

I did also look into the forced_update mechanism because that still
bugged me and have to say, I'd prefer removing rate_limit_us,
last_freq_update_time and freq_update_delay_ns altogether. The number
of updates blocked by the rate_limit was already pretty low and have
become negligible now for most workloads/platforms.
commit 37c6dccd6837 ("cpufreq: Remove LATENCY_MULTIPLIER") put the
rate_limit_us in the microseconds but even for rate_limit_us==2000
I get on a rk3588 ([LLLL][bb][bb]), 250HZ:

mainline:
update_util	update_util dropped by rate_limit_us	actual freq changes
60s idle:
932	48	12

fio --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/nullb0 --thinktime=1ms
40274	129	36

hackbench -pTl 20000
319331	523	41

with $SUBJECT and rate_limit_us==93:
60s idle:
1031	5	11	

fio --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/nullb0 --thinktime=1ms
40297	17	32

hackbench -pTl 20000
7252343	600	60

just to mention a few.
This obviously depends on the OPPs, workload, and HZ though.

Overall I find the update (mostly) coming from the perf-domain
(and thus sugov update_lock also mostly contending there) quite
appealing, but given we update more often in terms of frequency and
arguably have more code locations calling the update (reintroduction
of update at enqueue), what exactly are we still consolidating here?

Regards,
Christian