linux-kernel - Re: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0jvusVBcKECBueDHk5KQGda=GGuSGPO3F4wCvk3cro56A@mail.gmail.com>
Date:   Tue, 8 Oct 2019 12:49:01 +0200
From:   "Rafael J. Wysocki" <rafael@...nel.org>
To:     Doug Smythies <dsmythies@...us.net>
Cc:     "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Mel Gorman <mgorman@...e.de>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        "Chen, Hu" <hu1.chen@...el.com>,
        Quentin Perret <quentin.perret@....com>,
        Linux PM <linux-pm@...r.kernel.org>,
        Giovanni Gherdovich <ggherdovich@...e.cz>
Subject: Re: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor
 for tickless systems

On Tue, Oct 8, 2019 at 11:51 AM Rafael J. Wysocki <rafael@...nel.org> wrote:
>
> On Tue, Oct 8, 2019 at 8:20 AM Doug Smythies <dsmythies@...us.net> wrote:
> >
> > On 2019.10.06 08:34 Rafael J. Wysocki wrote:
> > > On Sun, Oct 6, 2019 at 4:46 PM Doug Smythies <dsmythies@...us.net> wrote:
> > >> On 2019.10.01 02:32 Rafael J. Wysocki wrote:
> > >>> On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies <dsmythies@...us.net> wrote:
> > >>>> On 2019.09.26 09:32 Doug Smythies wrote:
> > >>>>
> > >>>>> If the deepest idle state is disabled, the system
> > >>>>> can become somewhat unstable, with anywhere between no problem
> > >>>>> at all, to the occasional temporary jump using a lot more
> > >>>>> power for a few seconds, to a permanent jump using a lot more
> > >>>>> power continuously. I have been unable to isolate the exact
> > >>>>> test load conditions under which this will occur. However,
> > >>>>> temporarily disabling and then enabling other idle states
> > >>>>> seems to make for a somewhat repeatable test. It is important
> > >>>>> to note that the issue occurs with only ever disabling the deepest
> > >>>>> idle state, just not reliably.
> > >>>>>
> > >>>>> I want to know how you want to proceed before I do a bunch of
> > >>>>> regression testing.
> > >>>>
> > >> I do not think I stated it clearly before: The problem here is that some CPUs
> > >> seem to get stuck in idle state 0, and when they do power consumption spikes,
> > >> often by several hundred % and often indefinitely.
> > >
> > > That indeed has not been clear to me, thanks for the clarification!
> >
> > >
> > >> I made a hack job automated test:
> > >> Kernel  tests                 fail rate
> > >> 5.4-rc1                6616           13.45%
> > >> 5.3              2376            4.50%
> > >> 5.3-teov7       12136            0.00%  <<< teo.c reverted and teov7 put in its place.
> > >> 5.4-rc1-ds      11168        0.00%  <<< [old] proposed patch (> 7 hours test time)
> >
> >
> >    5.4-rc1-ds12   4224          0.005 <<< new proposed patch
> >
> > >>
> > >> [old] Proposed patch (on top of kernel 5.4-rc1): [deleted]
> >
> > > This change may cause the deepest state to be selected even if its
> > > "hits" metric is less than the "misses" one AFAICS, in which case the
> > > max_early_index state should be selected instead.
> > >
> > > It looks like the max_early_index computation is broken when the
> > > deepest state is disabled.
> >
> > O.K. Thanks for your quick reply, and insight.
> >
> > I think long durations always need to be counted, but currently if
> > the deepest idle state is disabled, they are not.
> > How about this?:
> > (test results added above, more tests pending if this might be a path forward.)
> >
> > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> > index b5a0e49..a970d2c 100644
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -155,10 +155,12 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> >
> >                 cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;
> >
> > -               if (drv->states[i].target_residency <= sleep_length_us) {
> > -                       idx_timer = i;
> > -                       if (drv->states[i].target_residency <= measured_us)
> > -                               idx_hit = i;
> > +               if (!(drv->states[i].disabled || dev->states_usage[i].disable)){
> > +                       if (drv->states[i].target_residency <= sleep_length_us) {
> > +                               idx_timer = i;
> > +                               if (drv->states[i].target_residency <= measured_us)
> > +                                       idx_hit = i;
> > +                       }
>
> What if the state is enabled again after some time?

Actually, the states are treated as "bins" here, so for the metrics it
doesn't matter whether or not they are enabled at the moment.

> >                 }
> >         }
> >
> > @@ -256,39 +258,25 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> >                 struct cpuidle_state *s = &drv->states[i];
> >                 struct cpuidle_state_usage *su = &dev->states_usage[i];
> >
> > -               if (s->disabled || su->disable) {
> > -                       /*
> > -                        * If the "early hits" metric of a disabled state is
> > -                        * greater than the current maximum, it should be taken
> > -                        * into account, because it would be a mistake to select
> > -                        * a deeper state with lower "early hits" metric.  The
> > -                        * index cannot be changed to point to it, however, so
> > -                        * just increase the max count alone and let the index
> > -                        * still point to a shallower idle state.
> > -                        */
> > -                       if (max_early_idx >= 0 &&
> > -                           count < cpu_data->states[i].early_hits)
> > -                               count = cpu_data->states[i].early_hits;
> > -
> > -                       continue;
>
> AFAICS, adding early_hits to count is not a mistake if there are still
> enabled states deeper than the current one.

And the mistake appears to be that the "hits" and "misses" metrics
aren't handled in analogy with the "early_hits" one when the current
state is disabled.

Let me try to cut a patch to address that.