linux-kernel - Re: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0jo-KQouuE3P51THvU33kViBVtDq1WknBFx+FWUY0e=ag@mail.gmail.com>
Date:   Sun, 6 Oct 2019 17:34:15 +0200
From:   "Rafael J. Wysocki" <rafael@...nel.org>
To:     Doug Smythies <dsmythies@...us.net>
Cc:     "Rafael J. Wysocki" <rafael@...nel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Mel Gorman <mgorman@...e.de>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        "Chen, Hu" <hu1.chen@...el.com>,
        Quentin Perret <quentin.perret@....com>,
        Linux PM <linux-pm@...r.kernel.org>,
        Giovanni Gherdovich <ggherdovich@...e.cz>
Subject: Re: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor
 for tickless systems

On Sun, Oct 6, 2019 at 4:46 PM Doug Smythies <dsmythies@...us.net> wrote:
>
> On 2019.10.01 02:32 Rafael J. Wysocki wrote:
> > On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies <dsmythies@...us.net> wrote:
> >> On 2019.09.26 09:32 Doug Smythies wrote:
> >>
> >>> If the deepest idle state is disabled, the system
> >>> can become somewhat unstable, with anywhere between no problem
> >>> at all, to the occasional temporary jump using a lot more
> >>> power for a few seconds, to a permanent jump using a lot more
> >>> power continuously. I have been unable to isolate the exact
> >>> test load conditions under which this will occur. However,
> >>> temporarily disabling and then enabling other idle states
> >>> seems to make for a somewhat repeatable test. It is important
> >>> to note that the issue occurs with only ever disabling the deepest
> >>> idle state, just not reliably.
> >>>
> >>> I want to know how you want to proceed before I do a bunch of
> >>> regression testing.
> >>
> >> I did some regression testing anyhow, more to create and debug
> >> a methodology than anything else.
> >>
> >>> On 2018.12.11 03:50 Rafael J. Wysocki wrote:
> >>>
> >>>> v7 -> v8:
> >>>>  * Apply the selection rules to the idle deepest state as well as to
> >>>>    the shallower ones (the deepest idle state was treated differently
> >>>>    before by mistake).
> >>>>  * Subtract 1/2 of the exit latency from the measured idle duration
> >>>>    in teo_update() (instead of subtracting the entire exit latency).
> >>>>    This makes the idle state selection be slightly more performance-
> >>>>   oriented.
> >>>
> >>> I have isolated the issue to a subset of the v7 to v8 changes, however
> >>> it was not the exit latency changes.
> >>>
> >>> The partial revert to V7 changes I made were (on top of 5.3):
> >>
> >> The further testing showed a problem or two with my partial teo-v7 reversion
> >> (I call it teo-v12) under slightly different testing conditions.
>
> Correction:
> There was no problem with my partial reversion kernel (a.k.a. teo-v12). The problem
> was confusion over which kernel I was actually running for whatever test.
>
> >>
> >> I also have a 5.3 based kernel with the current teo reverted and the entire
> >> teo-v7 put in its place. I have yet to find a idle state disabled related issue
> >> with this kernel.
> >>
> >> I'll come back to this thread at a later date with better details and test results.
> >
> > Thanks for this work!
> >
> > Please also note that there is a teo patch in 5.4-rc1 that may make a
> > difference in principle.
>
> Yes, actually this saga started from somewhere between kernel 5.3 and 5.4-rc1,
> and did include those teo patches, which actually significantly increases the
> probability of the issue occurring.
>
> When the deepest idle state is disabled, and the all states search loop exits
> normally, it might incorrectly re-evaluate a previous idle state previously
> deemed not worthy of the check. This was introduced between teo development
> versions 7 and 8. The fix is to move the code back inside the loop.
> (I'll submit a patch in a day or two).

OK

> I do not think I stated it clearly before: The problem here is that some CPUs
> seem to get stuck in idle state 0, and when they do power consumption spikes,
> often by several hundred % and often indefinitely.

That indeed has not been clear to me, thanks for the clarification!

> I made a hack job automated test:
> Kernel  tests           fail rate
> 5.4-rc1  6616           13.45%
> 5.3              2376            4.50%
> 5.3-teov7       12136            0.00%  <<< teo.c reverted and teov7 put in its place.
> 5.4-rc1-ds      11168        0.00%  <<< proposed patch (> 7 hours test time)
>
> Proposed patch (on top of kernel 5.4-rc1):
>
> doug@s15:~/temp-k-git/linux$ git diff
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index b5a0e49..0502aa9 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -276,8 +276,22 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                 if (idx < 0)
>                         idx = i; /* first enabled state */
>
> -               if (s->target_residency > duration_us)
> +               if (s->target_residency > duration_us){
> +                       /*
> +                        * If the "hits" metric of the idle state matching the sleep length is
> +                        * greater than its "misses" metric, that is the one to use.  Otherwise,
> +                        * it is more likely that one of the shallower states will match the
> +                        * idle duration observed after wakeup, so take the one with the maximum
> +                        * "early hits" metric, but if that cannot be determined, just use the
> +                        * state selected so far.
> +                        */
> +                       if (cpu_data->states[idx].hits <= cpu_data->states[idx].misses &&
> +                           max_early_idx >= 0) {
> +                               idx = max_early_idx;
> +                               duration_us = drv->states[idx].target_residency;
> +                       }
>                         break;
> +               }
>
>                 if (s->exit_latency > latency_req && constraint_idx > i)
>                         constraint_idx = i;
> @@ -293,20 +307,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>         }
>
>         /*
> -        * If the "hits" metric of the idle state matching the sleep length is
> -        * greater than its "misses" metric, that is the one to use.  Otherwise,
> -        * it is more likely that one of the shallower states will match the
> -        * idle duration observed after wakeup, so take the one with the maximum
> -        * "early hits" metric, but if that cannot be determined, just use the
> -        * state selected so far.
> -        */
> -       if (cpu_data->states[idx].hits <= cpu_data->states[idx].misses &&
> -           max_early_idx >= 0) {
> -               idx = max_early_idx;
> -               duration_us = drv->states[idx].target_residency;
> -       }
> -
> -       /*
>          * If there is a latency constraint, it may be necessary to use a
>          * shallower idle state than the one selected so far.
>          */

This change may cause the deepest state to be selected even if its
"hits" metric is less than the "misses" one AFAICS, in which case the
max_early_index state should be selected instead.

It looks like the max_early_index computation is broken when the
deepest state is disabled.