[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d9b202d75c484ef3b636ce181d1c04ca@MSX-L104.msx.ad.zih.tu-dresden.de>
Date: Tue, 20 Mar 2018 11:01:48 +0100
From: Thomas Ilsche <thomas.ilsche@...dresden.de>
To: "Rafael J. Wysocki" <rafael@...nel.org>,
Peter Zijlstra <peterz@...radead.org>
CC: Doug Smythies <dsmythies@...us.net>,
Linux PM <linux-pm@...r.kernel.org>,
Frederic Weisbecker <fweisbec@...il.com>,
Thomas Gleixner <tglx@...utronix.de>,
Paul McKenney <paulmck@...ux.vnet.ibm.com>,
"Rik van Riel" <riel@...riel.com>,
Aubrey Li <aubrey.li@...ux.intel.com>,
"Mike Galbraith" <mgalbraith@...e.de>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFT][PATCH v5 0/7] sched/cpuidle: Idle loop rework
On 2018-03-18 17:15, Rafael J. Wysocki wrote:
>> Doug, Thomas,
>>
>> Thank you both for the reports, much appreciated!
>>
>> Below is a drop-in v6 replacement for patch [4/7].
>>
>> With this new patch applied instead of the [4/7] the behavior should be much
>> more in line with the v4 behavior, so please try it if you can and let me know
>> if that really is the case on your systems.
>>
>> Patches [5-7/7] from the original v5 apply on top of it right away for me,
>> but I've also created a git branch you can use to pull all of the series
>> with the below included:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
>> idle-loop
Thanks for the git repo, that helps alot. I have tested v6 on a
Skylake desktop and server system as well as a Haswell server system.
The odd idle behavior of v5 is gone.
Some of the other findings may be obsolete by the upcoming respin,
I will retest.
Our originally observed Powernightmare pattern is effectively
prevented in both idle and with a synthetic trigger. However, I can
reproduce simple workloads under which the revised menu governor
wastes energy by going into *deeper* C-states than advisable.
Consider the Skylake server system which has residencies in C1E of
20 us and C6 of 800 us. I use a small while(1) {usleep(300);}
unsynchronized pinned to each core. While this is an artificial
case, it is a very innocent one - easy to predict and regular. Between
vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases
from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost
entirely in C1E. With the patches applied, the cores spend ~75% of
their sleep time in C6, ~25% in C1E. The average time/usage for C1E is
also lower with v6 at ~350 us rather than the ~550 us in C6 (and in
C1E with the baseline). Generally the new menu governor seems to chose
C1E if the next timer is an enabled sched timer - which occasionally
interrupts the sleep-interval into two C1E sleeps rather than one C6.
Manually disabling C6, reduces power consumption back to 149.5 W.
This is far from what I expected, I did not yet figure out why the
patched menu governor decides to go to C6 under that workload. I
have tested this previously with v4 and saw this behavior even
without path "7/7".
The results from Haswell-EP and Skylake desktop are similar.
The tests are with a 1000 Hz kernel because I wanted to amplify
effects that happening when C-state residencies and tick timers are
closer together. But I suspect the results will be similar with
300 Hz as the impact from the sched tick interruption seems to be
minor compared to the odd C-state selection.
Some very raw illustrations, all from Skylake SP (2 == C1E, 3 == C6):
power consumption
trigger-10-10 is the synthetic Powernightmare
poller-omp-300 is the parallel usleep(300) loop:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_power.png
cstate utilization with usleep(300) loop
(as per /sys/.../stateN/time / wall time)
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_poll_300_utilization.png
average time spent in cstates
(as /sys/.../stateN/time / /sys/.../stateN/usage)
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_poll_300_avg_time.png
detailed look:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_poll_300_skl.png
>>
>> Thanks!
>>
>> ---
>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>> Subject: [PATCH v6] cpuidle: Return nohz hint from cpuidle_select()
>>
>> Add a new pointer argument to cpuidle_select() and to the ->select
>> cpuidle governor callback to allow a boolean value indicating
>> whether or not the tick should be stopped before entering the
>> selected state to be returned from there.
>>
>> Make the ladder governor ignore that pointer (to preserve its
>> current behavior) and make the menu governor return 'false" through
>> it if:
>> (1) the idle exit latency is constrained at 0,
>> (2) the selected state is a polling one, or
>> (3) the selected state is not deep enough.
>>
>> Since the value returned through the new argument pointer is not
>> used yet, this change is not expected to alter the functionality of
>> the code.
>>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>> ---
>
> [cut]
>
>> @@ -354,6 +360,7 @@ static int menu_select(struct cpuidle_dr
>> if (latency_req > interactivity_req)
>> latency_req = interactivity_req;
>>
>> + expected_interval = TICK_USEC_HZ;
>> /*
>> * Find the idle state with the lowest power while satisfying
>> * our constraints.
>> @@ -367,17 +374,44 @@ static int menu_select(struct cpuidle_dr
>> continue;
>> if (idx == -1)
>> idx = i; /* first enabled state */
>> - if (s->target_residency > data->predicted_us)
>> + if (s->target_residency > data->predicted_us) {
>> + /*
>> + * Retain the tick if the selected state is shallower
>> + * than the deepest available one with target residency
>> + * within the tick period range.
>> + *
>> + * This allows the tick to be stopped even if the
>> + * predicted idle duration is within the tick period
>> + * range to counter the effect by which the prediction
>> + * may be skewed towards lower values due to the tick
>> + * bias.
>> + */
>> + expected_interval = s->target_residency;
>> break;
>
> BTW, I guess I need to explain the motivation here more thoroughly, so
> here it goes.
>
> The governor predicts idle duration under the assumption that the
> tick will be stopped, so if the result of the prediction is within the tick
> period range and it is not accurate, that needs to be taken into
> account in the governor's statistics. However, if the tick is allowed
> to run every time the governor predicts idle duration within the tick
> period range, the governor will always see that it was "almost
> right" and the correction factor applied by it to improve the
> prediction next time will not be sufficient. For this reason, it
> is better to stop the tick at least sometimes when the governor
> predicts idle duration within the tick period range and the idea
> here is to do that when the selected state is the deepest available
> one with the target residency within the tick period range. This
> allows the opportunity to save more energy to be seized which
> balances the extra overhead of stopping the tick.
>
> HTH
>
--
Dipl. Inf. Thomas Ilsche
Computer Scientist
Highly Adaptive Energy-Efficient Computing
CRC 912 HAEC: http://tu-dresden.de/sfb912
Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany
Phone: +49 351 463-42168
Fax: +49 351 463-37773
E-Mail: thomas.ilsche@...dresden.de
Download attachment "smime.p7s" of type "application/pkcs7-signature" (5214 bytes)
Powered by blists - more mailing lists