lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d9b202d75c484ef3b636ce181d1c04ca@MSX-L104.msx.ad.zih.tu-dresden.de>
Date:   Tue, 20 Mar 2018 11:01:48 +0100
From:   Thomas Ilsche <thomas.ilsche@...dresden.de>
To:     "Rafael J. Wysocki" <rafael@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
CC:     Doug Smythies <dsmythies@...us.net>,
        Linux PM <linux-pm@...r.kernel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul McKenney <paulmck@...ux.vnet.ibm.com>,
        "Rik van Riel" <riel@...riel.com>,
        Aubrey Li <aubrey.li@...ux.intel.com>,
        "Mike Galbraith" <mgalbraith@...e.de>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFT][PATCH v5 0/7] sched/cpuidle: Idle loop rework

On 2018-03-18 17:15, Rafael J. Wysocki wrote:
>> Doug, Thomas,
>>
>> Thank you both for the reports, much appreciated!
>>
>> Below is a drop-in v6 replacement for patch [4/7].
>>
>> With this new patch applied instead of the [4/7] the behavior should be much
>> more in line with the v4 behavior, so please try it if you can and let me know
>> if that really is the case on your systems.
>>
>> Patches [5-7/7] from the original v5 apply on top of it right away for me,
>> but I've also created a git branch you can use to pull all of the series
>> with the below included:
>>
>>   git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
>>   idle-loop

Thanks for the git repo, that helps alot. I have tested v6 on a
Skylake desktop and server system as well as a Haswell server system.
The odd idle behavior of v5 is gone.

Some of the other findings may be obsolete by the upcoming respin,
I will retest.

Our originally observed Powernightmare pattern is effectively
prevented in both idle and with a synthetic trigger. However, I can
reproduce simple workloads under which the revised menu governor
wastes energy by going into *deeper* C-states than advisable.

Consider the Skylake server system which has residencies in C1E of
20 us and C6 of 800 us. I use a small while(1) {usleep(300);}
unsynchronized pinned to each core. While this is an artificial
case, it is a very innocent one - easy to predict and regular. Between
vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases
from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost
entirely in C1E. With the patches applied, the cores spend ~75% of
their sleep time in C6, ~25% in C1E. The average time/usage for C1E is
also lower with v6 at ~350 us rather than the ~550 us in C6 (and in
C1E with the baseline). Generally the new menu governor seems to chose
C1E if the next timer is an enabled sched timer - which occasionally
interrupts the sleep-interval into two C1E sleeps rather than one C6.

Manually disabling C6, reduces power consumption back to 149.5 W.

This is far from what I expected, I did not yet figure out why the
patched menu governor decides to go to C6 under that workload. I
have tested this previously with v4 and saw this behavior even
without path "7/7".

The results from Haswell-EP and Skylake desktop are similar.

The tests are with a 1000 Hz kernel because I wanted to amplify
effects that happening when C-state residencies and tick timers are
closer together. But I suspect the results will be similar with
300 Hz as the impact from the sched tick interruption seems to be
minor compared to the odd C-state selection.

Some very raw illustrations, all from Skylake SP (2 == C1E, 3 == C6):
power consumption
trigger-10-10 is the synthetic Powernightmare
poller-omp-300 is the parallel  usleep(300) loop:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_power.png

cstate utilization with usleep(300) loop
(as per /sys/.../stateN/time / wall time)
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_poll_300_utilization.png

average time spent in cstates
(as /sys/.../stateN/time / /sys/.../stateN/usage)
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_poll_300_avg_time.png

detailed look:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_poll_300_skl.png


>>
>> Thanks!
>>
>> ---
>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>> Subject: [PATCH v6] cpuidle: Return nohz hint from cpuidle_select()
>>
>> Add a new pointer argument to cpuidle_select() and to the ->select
>> cpuidle governor callback to allow a boolean value indicating
>> whether or not the tick should be stopped before entering the
>> selected state to be returned from there.
>>
>> Make the ladder governor ignore that pointer (to preserve its
>> current behavior) and make the menu governor return 'false" through
>> it if:
>>   (1) the idle exit latency is constrained at 0,
>>   (2) the selected state is a polling one, or
>>   (3) the selected state is not deep enough.
>>
>> Since the value returned through the new argument pointer is not
>> used yet, this change is not expected to alter the functionality of
>> the code.
>>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>> ---
> 
> [cut]
> 
>> @@ -354,6 +360,7 @@ static int menu_select(struct cpuidle_dr
>>          if (latency_req > interactivity_req)
>>                  latency_req = interactivity_req;
>>
>> +       expected_interval = TICK_USEC_HZ;
>>          /*
>>           * Find the idle state with the lowest power while satisfying
>>           * our constraints.
>> @@ -367,17 +374,44 @@ static int menu_select(struct cpuidle_dr
>>                          continue;
>>                  if (idx == -1)
>>                          idx = i; /* first enabled state */
>> -               if (s->target_residency > data->predicted_us)
>> +               if (s->target_residency > data->predicted_us) {
>> +                       /*
>> +                        * Retain the tick if the selected state is shallower
>> +                        * than the deepest available one with target residency
>> +                        * within the tick period range.
>> +                        *
>> +                        * This allows the tick to be stopped even if the
>> +                        * predicted idle duration is within the tick period
>> +                        * range to counter the effect by which the prediction
>> +                        * may be skewed towards lower values due to the tick
>> +                        * bias.
>> +                        */
>> +                       expected_interval = s->target_residency;
>>                          break;
> 
> BTW, I guess I need to explain the motivation here more thoroughly, so
> here it goes.
> 
> The governor predicts idle duration under the assumption that the
> tick will be stopped, so if the result of the prediction is within the tick
> period range and it is not accurate, that needs to be taken into
> account in the governor's statistics.  However, if the tick is allowed
> to run every time the governor predicts idle duration within the tick
> period range, the governor will always see that it was "almost
> right" and the correction factor applied by it to improve the
> prediction next time will not be sufficient.  For this reason, it
> is better to stop the tick at least sometimes when the governor
> predicts idle duration within the tick period range and the idea
> here is to do that when the selected state is the deepest available
> one with the target residency within the tick period range.  This
> allows the opportunity to save more energy to be seized which
> balances the extra overhead of stopping the tick.
> 
> HTH
> 

-- 
Dipl. Inf. Thomas Ilsche
Computer Scientist
Highly Adaptive Energy-Efficient Computing
CRC 912 HAEC: http://tu-dresden.de/sfb912
Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: +49 351 463-42168
Fax: +49 351 463-37773
E-Mail: thomas.ilsche@...dresden.de


Download attachment "smime.p7s" of type "application/pkcs7-signature" (5214 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ