linux-kernel - Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230728220109.GA3934165@hirez.programming.kicks-ass.net>
Date:   Sat, 29 Jul 2023 00:01:09 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     "Rafael J. Wysocki" <rafael@...nel.org>
Cc:     anna-maria@...utronix.de, tglx@...utronix.de, frederic@...nel.org,
        gautham.shenoy@....com, linux-kernel@...r.kernel.org,
        daniel.lezcano@...aro.org, linux-pm@...r.kernel.org,
        mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com
Subject: Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management

On Fri, Jul 28, 2023 at 06:56:24PM +0200, Rafael J. Wysocki wrote:
> On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@...radead.org> wrote:

> > @@ -276,11 +276,11 @@ static void teo_update(struct cpuidle_dr
> >
> >                 cpu_data->total += bin->hits + bin->intercepts;
> >
> > -               if (target_residency_ns <= cpu_data->sleep_length_ns) {
> > +               if (target_residency_ns <= cpu_data->sleep_length_ns)
> >                         idx_timer = i;
> > -                       if (target_residency_ns <= measured_ns)
> > -                               idx_duration = i;
> > -               }
> > +
> > +               if (target_residency_ns <= measured_ns)
> > +                       idx_duration = i;
> 
> I'm not quite sure what happens here.

Oh, I couldn't convince myself that measured_ns <= sleep_length_ns. If
measured was longer we still want the higher index.

But yeah, I forgots I had that hunk in.

> >         }
> >
> >         i = cpu_data->next_recent_idx++;
> > @@ -362,11 +362,12 @@ static int teo_select(struct cpuidle_dri
> >         unsigned int recent_sum = 0;
> >         unsigned int idx_hit_sum = 0;
> >         unsigned int hit_sum = 0;
> > +       unsigned int tick_sum = 0;
> >         int constraint_idx = 0;
> >         int idx0 = 0, idx = -1;
> >         bool alt_intercepts, alt_recent;
> >         ktime_t delta_tick;
> > -       s64 duration_ns;
> > +       s64 duration_ns = TICK_NSEC;
> >         int i;
> >
> >         if (dev->last_state_idx >= 0) {
> > @@ -376,36 +377,26 @@ static int teo_select(struct cpuidle_dri
> >
> >         cpu_data->time_span_ns = local_clock();
> >
> > -       duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> > -       cpu_data->sleep_length_ns = duration_ns;
> > +       /* Should we stop the tick? */
> 
> Who's we?  I'd prefer something like "Should the tick be stopped?"
> here (analogously below).

Sure.

> > +       for (i = 1; i < drv->state_count; i++) {
> > +               struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
> > +               struct cpuidle_state *s = &drv->states[i];
> >
> > -       /* Check if there is any choice in the first place. */
> > -       if (drv->state_count < 2) {
> > -               idx = 0;
> > -               goto end;
> > -       }
> > -       if (!dev->states_usage[0].disable) {
> > -               idx = 0;
> > -               if (drv->states[1].target_residency_ns > duration_ns)
> > -                       goto end;
> > -       }
> > +               tick_sum += prev_bin->intercepts;
> > +               tick_sum += prev_bin->hits;
> >
> > -       cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
> > -       /*
> > -        * If the CPU is being utilized over the threshold and there are only 2
> > -        * states to choose from, the metrics need not be considered, so choose
> > -        * the shallowest non-polling state and exit.
> > -        */
> > -       if (drv->state_count < 3 && cpu_data->utilized) {
> > -               for (i = 0; i < drv->state_count; ++i) {
> > -                       if (!dev->states_usage[i].disable &&
> > -                           !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
> > -                               idx = i;
> > -                               goto end;
> > -                       }
> > -               }
> > +               if (s->target_residency_ns >= SHORT_TICK_NSEC)
> > +                       break;
> >         }
> >
> > +       if (2*tick_sum > cpu_data->total)
> > +               *stop_tick = false;
> 
> This means "if over 50% of all the events fall into the buckets below
> the tick period length, don't stop the tick".  Fair enough, but this
> covers long-term only and what about the most recent events?  I think
> that they need to be taken into account here too.

>From looking at a few traces this 'long' term is around 8-10 samples.
Which I figured was quick enough.

Note that DECAY_SHIFT is 3, while the pulse is 10 bits, so 3-4 cycles
will drain most of the history when there's a distinct phase shift.

That said; I did look at the recent thing and those seem geared towards
the intercepts, while I think hits+intercepts makes more sense here.
Given it adjusted quickly enough I didn't put more time in it.

> > +
> > +       /* If we do stop the tick, ask for the next timer. */
> > +       if (*stop_tick)
> > +               duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> > +       cpu_data->sleep_length_ns = duration_ns;
> 
> If the decision is made to retain the tick and the time to the closest
> tick event is very small, it would be better to refine the state
> selection so as to avoid returning a state with the target residency
> above that time (which essentially is wasting energy).  That's what
> delta_tick above is for, but now tick_nohz_get_sleep_length() is never
> called when the tick is not going to be stopped.

Right, so I did ponder using something like
ktime_sub(tick_nohz_get_next_hrtimer(), now) instead of TICK_NSEC to get
a more accurate measure, but I didn't do so yet.

> Besides, if I'm not mistaken, setting sleep_length_ns to TICK_NSEC
> every time the tick is not stopped will not really work on systems
> where there are real idle states with target residencies beyond
> TICK_NSEC.

It does work; you really don't want to select such a state if the tick
is still active -- you'll never get your residency. Such a state should
really only be used when the tick is off.

> > +
> >         /*
> >          * Find the deepest idle state whose target residency does not exceed
> >          * the current sleep length and the deepest idle state not deeper than
> > @@ -446,13 +437,13 @@ static int teo_select(struct cpuidle_dri
> >                 idx_recent_sum = recent_sum;
> >         }
> >
> > -       /* Avoid unnecessary overhead. */
> > -       if (idx < 0) {
> > -               idx = 0; /* No states enabled, must use 0. */
> > -               goto end;
> > -       } else if (idx == idx0) {
> > -               goto end;
> > -       }
> > +       /* No states enabled, must use 0 */
> > +       if (idx < 0)
> > +               return 0;
> > +
> > +       /* No point looking for something shallower than the first enabled state */
> > +       if (idx == idx0)
> > +               return idx;
> >
> >         /*
> >          * If the sum of the intercepts metric for all of the idle states
> > @@ -541,29 +532,9 @@ static int teo_select(struct cpuidle_dri
> >          * If the CPU is being utilized over the threshold, choose a shallower
> >          * non-polling state to improve latency
> >          */
> > -       if (cpu_data->utilized)
> > +       if (teo_cpu_is_utilized(dev->cpu, cpu_data))
> >                 idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> >
> > -end:
> > -       /*
> > -        * Don't stop the tick if the selected state is a polling one or if the
> > -        * expected idle duration is shorter than the tick period length.
> > -        */
> > -       if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
> > -           duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {
> > -               *stop_tick = false;
> > -
> > -               /*
> > -                * The tick is not going to be stopped, so if the target
> > -                * residency of the state to be returned is not within the time
> > -                * till the closest timer including the tick, try to correct
> > -                * that.
> > -                */
> > -               if (idx > idx0 &&
> > -                   drv->states[idx].target_residency_ns > delta_tick)
> > -                       idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false);
> > -       }
> > -
> >         return idx;
> >  }
> 
> Overall, I think that the problem with calling
> tick_nohz_get_sleep_length() is limited to the cases when the CPU is
> almost fully loaded, so the overall amount of idle time on it is tiny.
> I would rather use a special pah for those cases and I would register
> all of the wakeups as "intercepts" in those cases.

I'm not sure what you're proposing. If we track the tick+ bucket -- as
we must in order to say anything useful about it, then we can decide the
tick state before (as I do here) calling sleep_length().

The timer-pull rework from Anna-Maria unfortunately makes the
tick_nohz_get_sleep_length() thing excessively expensive and it really
doesn't make sense to call it when we retain the tick.

It's all a bit of a chicken-egg situation, cpuidle wants to know when
the next timer is, but telling when that is, wants to know if the tick
stays. We need to break that somehow -- I propose by not calling it when
we know we'll keep the tick.