linux-kernel - RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <000d01d57da0$8410f1c0$8c32d540$@net>
Date:   Mon, 7 Oct 2019 23:20:38 -0700
From:   "Doug Smythies" <dsmythies@...us.net>
To:     "'Rafael J. Wysocki'" <rafael@...nel.org>
Cc:     "'Rafael J. Wysocki'" <rjw@...ysocki.net>,
        "'Srinivas Pandruvada'" <srinivas.pandruvada@...ux.intel.com>,
        "'Peter Zijlstra'" <peterz@...radead.org>,
        "'LKML'" <linux-kernel@...r.kernel.org>,
        "'Frederic Weisbecker'" <frederic@...nel.org>,
        "'Mel Gorman'" <mgorman@...e.de>,
        "'Daniel Lezcano'" <daniel.lezcano@...aro.org>,
        "'Chen, Hu'" <hu1.chen@...el.com>,
        "'Quentin Perret'" <quentin.perret@....com>,
        "'Linux PM'" <linux-pm@...r.kernel.org>,
        "'Giovanni Gherdovich'" <ggherdovich@...e.cz>
Subject: RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

On 2019.10.06 08:34 Rafael J. Wysocki wrote:
> On Sun, Oct 6, 2019 at 4:46 PM Doug Smythies <dsmythies@...us.net> wrote:
>> On 2019.10.01 02:32 Rafael J. Wysocki wrote:
>>> On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies <dsmythies@...us.net> wrote:
>>>> On 2019.09.26 09:32 Doug Smythies wrote:
>>>>
>>>>> If the deepest idle state is disabled, the system
>>>>> can become somewhat unstable, with anywhere between no problem
>>>>> at all, to the occasional temporary jump using a lot more
>>>>> power for a few seconds, to a permanent jump using a lot more
>>>>> power continuously. I have been unable to isolate the exact
>>>>> test load conditions under which this will occur. However,
>>>>> temporarily disabling and then enabling other idle states
>>>>> seems to make for a somewhat repeatable test. It is important
>>>>> to note that the issue occurs with only ever disabling the deepest
>>>>> idle state, just not reliably.
>>>>>
>>>>> I want to know how you want to proceed before I do a bunch of
>>>>> regression testing.
>>>>
>> I do not think I stated it clearly before: The problem here is that some CPUs
>> seem to get stuck in idle state 0, and when they do power consumption spikes,
>> often by several hundred % and often indefinitely.
>
> That indeed has not been clear to me, thanks for the clarification!

>
>> I made a hack job automated test:
>> Kernel  tests  	         fail rate
>> 5.4-rc1		  6616           13.45%
>> 5.3              2376            4.50%
>> 5.3-teov7       12136            0.00%  <<< teo.c reverted and teov7 put in its place.
>> 5.4-rc1-ds      11168        0.00%  <<< [old] proposed patch (> 7 hours test time)


   5.4-rc1-ds12	  4224		0.005 <<< new proposed patch

>>
>> [old] Proposed patch (on top of kernel 5.4-rc1): [deleted]

> This change may cause the deepest state to be selected even if its
> "hits" metric is less than the "misses" one AFAICS, in which case the
> max_early_index state should be selected instead.
> 
> It looks like the max_early_index computation is broken when the
> deepest state is disabled.

O.K. Thanks for your quick reply, and insight.

I think long durations always need to be counted, but currently if
the deepest idle state is disabled, they are not.
How about this?:
(test results added above, more tests pending if this might be a path forward.)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index b5a0e49..a970d2c 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -155,10 +155,12 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)

                cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;

-               if (drv->states[i].target_residency <= sleep_length_us) {
-                       idx_timer = i;
-                       if (drv->states[i].target_residency <= measured_us)
-                               idx_hit = i;
+               if (!(drv->states[i].disabled || dev->states_usage[i].disable)){
+                       if (drv->states[i].target_residency <= sleep_length_us) {
+                               idx_timer = i;
+                               if (drv->states[i].target_residency <= measured_us)
+                                       idx_hit = i;
+                       }
                }
        }

@@ -256,39 +258,25 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
                struct cpuidle_state *s = &drv->states[i];
                struct cpuidle_state_usage *su = &dev->states_usage[i];

-               if (s->disabled || su->disable) {
-                       /*
-                        * If the "early hits" metric of a disabled state is
-                        * greater than the current maximum, it should be taken
-                        * into account, because it would be a mistake to select
-                        * a deeper state with lower "early hits" metric.  The
-                        * index cannot be changed to point to it, however, so
-                        * just increase the max count alone and let the index
-                        * still point to a shallower idle state.
-                        */
-                       if (max_early_idx >= 0 &&
-                           count < cpu_data->states[i].early_hits)
-                               count = cpu_data->states[i].early_hits;
-
-                       continue;
-               }

-               if (idx < 0)
-                       idx = i; /* first enabled state */
+               if (!(s->disabled || su->disable)) {
+                       if (idx < 0)
+                               idx = i; /* first enabled state */

-               if (s->target_residency > duration_us)
-                       break;
+                       if (s->target_residency > duration_us)
+                               break;

-               if (s->exit_latency > latency_req && constraint_idx > i)
-                       constraint_idx = i;
+                       if (s->exit_latency > latency_req && constraint_idx > i)
+                               constraint_idx = i;

-               idx = i;
+                       idx = i;

-               if (count < cpu_data->states[i].early_hits &&
-                   !(tick_nohz_tick_stopped() &&
-                     drv->states[i].target_residency < TICK_USEC)) {
-                       count = cpu_data->states[i].early_hits;
-                       max_early_idx = i;
+                       if (count < cpu_data->states[i].early_hits &&
+                           !(tick_nohz_tick_stopped() &&
+                             drv->states[i].target_residency < TICK_USEC)) {
+                               count = cpu_data->states[i].early_hits;
+                               max_early_idx = i;
+                       }
                }
        }