linux-kernel - Re: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0gRSpNtwDXrRr9GW2O9ZQpM0yBdKfQDXLwsZua5692yUQ@mail.gmail.com>
Date:   Tue, 8 Oct 2019 11:51:31 +0200
From:   "Rafael J. Wysocki" <rafael@...nel.org>
To:     Doug Smythies <dsmythies@...us.net>
Cc:     "Rafael J. Wysocki" <rafael@...nel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Mel Gorman <mgorman@...e.de>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        "Chen, Hu" <hu1.chen@...el.com>,
        Quentin Perret <quentin.perret@....com>,
        Linux PM <linux-pm@...r.kernel.org>,
        Giovanni Gherdovich <ggherdovich@...e.cz>
Subject: Re: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor
 for tickless systems

On Tue, Oct 8, 2019 at 8:20 AM Doug Smythies <dsmythies@...us.net> wrote:
>
> On 2019.10.06 08:34 Rafael J. Wysocki wrote:
> > On Sun, Oct 6, 2019 at 4:46 PM Doug Smythies <dsmythies@...us.net> wrote:
> >> On 2019.10.01 02:32 Rafael J. Wysocki wrote:
> >>> On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies <dsmythies@...us.net> wrote:
> >>>> On 2019.09.26 09:32 Doug Smythies wrote:
> >>>>
> >>>>> If the deepest idle state is disabled, the system
> >>>>> can become somewhat unstable, with anywhere between no problem
> >>>>> at all, to the occasional temporary jump using a lot more
> >>>>> power for a few seconds, to a permanent jump using a lot more
> >>>>> power continuously. I have been unable to isolate the exact
> >>>>> test load conditions under which this will occur. However,
> >>>>> temporarily disabling and then enabling other idle states
> >>>>> seems to make for a somewhat repeatable test. It is important
> >>>>> to note that the issue occurs with only ever disabling the deepest
> >>>>> idle state, just not reliably.
> >>>>>
> >>>>> I want to know how you want to proceed before I do a bunch of
> >>>>> regression testing.
> >>>>
> >> I do not think I stated it clearly before: The problem here is that some CPUs
> >> seem to get stuck in idle state 0, and when they do power consumption spikes,
> >> often by several hundred % and often indefinitely.
> >
> > That indeed has not been clear to me, thanks for the clarification!
>
> >
> >> I made a hack job automated test:
> >> Kernel  tests                 fail rate
> >> 5.4-rc1                6616           13.45%
> >> 5.3              2376            4.50%
> >> 5.3-teov7       12136            0.00%  <<< teo.c reverted and teov7 put in its place.
> >> 5.4-rc1-ds      11168        0.00%  <<< [old] proposed patch (> 7 hours test time)
>
>
>    5.4-rc1-ds12   4224          0.005 <<< new proposed patch
>
> >>
> >> [old] Proposed patch (on top of kernel 5.4-rc1): [deleted]
>
> > This change may cause the deepest state to be selected even if its
> > "hits" metric is less than the "misses" one AFAICS, in which case the
> > max_early_index state should be selected instead.
> >
> > It looks like the max_early_index computation is broken when the
> > deepest state is disabled.
>
> O.K. Thanks for your quick reply, and insight.
>
> I think long durations always need to be counted, but currently if
> the deepest idle state is disabled, they are not.
> How about this?:
> (test results added above, more tests pending if this might be a path forward.)
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index b5a0e49..a970d2c 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -155,10 +155,12 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>
>                 cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;
>
> -               if (drv->states[i].target_residency <= sleep_length_us) {
> -                       idx_timer = i;
> -                       if (drv->states[i].target_residency <= measured_us)
> -                               idx_hit = i;
> +               if (!(drv->states[i].disabled || dev->states_usage[i].disable)){
> +                       if (drv->states[i].target_residency <= sleep_length_us) {
> +                               idx_timer = i;
> +                               if (drv->states[i].target_residency <= measured_us)
> +                                       idx_hit = i;
> +                       }

What if the state is enabled again after some time?

>                 }
>         }
>
> @@ -256,39 +258,25 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                 struct cpuidle_state *s = &drv->states[i];
>                 struct cpuidle_state_usage *su = &dev->states_usage[i];
>
> -               if (s->disabled || su->disable) {
> -                       /*
> -                        * If the "early hits" metric of a disabled state is
> -                        * greater than the current maximum, it should be taken
> -                        * into account, because it would be a mistake to select
> -                        * a deeper state with lower "early hits" metric.  The
> -                        * index cannot be changed to point to it, however, so
> -                        * just increase the max count alone and let the index
> -                        * still point to a shallower idle state.
> -                        */
> -                       if (max_early_idx >= 0 &&
> -                           count < cpu_data->states[i].early_hits)
> -                               count = cpu_data->states[i].early_hits;
> -
> -                       continue;

AFAICS, adding early_hits to count is not a mistake if there are still
enabled states deeper than the current one.

Besides, can you just leave the "continue" here instead of changing
the indentation level for everything below?

> -               }
>
> -               if (idx < 0)
> -                       idx = i; /* first enabled state */
> +               if (!(s->disabled || su->disable)) {
> +                       if (idx < 0)
> +                               idx = i; /* first enabled state */
>
> -               if (s->target_residency > duration_us)
> -                       break;
> +                       if (s->target_residency > duration_us)
> +                               break;
>
> -               if (s->exit_latency > latency_req && constraint_idx > i)
> -                       constraint_idx = i;
> +                       if (s->exit_latency > latency_req && constraint_idx > i)
> +                               constraint_idx = i;
>
> -               idx = i;
> +                       idx = i;
>
> -               if (count < cpu_data->states[i].early_hits &&
> -                   !(tick_nohz_tick_stopped() &&
> -                     drv->states[i].target_residency < TICK_USEC)) {
> -                       count = cpu_data->states[i].early_hits;
> -                       max_early_idx = i;
> +                       if (count < cpu_data->states[i].early_hits &&
> +                           !(tick_nohz_tick_stopped() &&
> +                             drv->states[i].target_residency < TICK_USEC)) {
> +                               count = cpu_data->states[i].early_hits;
> +                               max_early_idx = i;
> +                       }
>                 }
>         }