linux-kernel - Re: [PATCH v1 7/9] cpuidle: teo: Skip getting the sleep length is wakeups are very frequent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0iZ9z_rfD4raSD9n-HQGYKy85A=u3MdC7DJUAmyh8dmkw@mail.gmail.com>
Date: Mon, 20 Jan 2025 17:24:04 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Christian Loehle <christian.loehle@....com>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>, Linux PM <linux-pm@...r.kernel.org>, 
	LKML <linux-kernel@...r.kernel.org>, Daniel Lezcano <daniel.lezcano@...aro.org>, 
	Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>
Subject: Re: [PATCH v1 7/9] cpuidle: teo: Skip getting the sleep length is
 wakeups are very frequent

On Mon, Jan 20, 2025 at 1:08 PM Christian Loehle
<christian.loehle@....com> wrote:
>
> On 1/13/25 18:48, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> >
>
> Title has a typo: s/is/if

Fixed when applying the patch.

> > Commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length()
> > call in some cases") attempted to reduce the governor overhead in some
> > cases by making it avoid obtaining the sleep length (the time till the
> > next timer event) which may be costly.
> >
> > Among other things, after the above commit, tick_nohz_get_sleep_length()
> > was not called any more when idle state 0 was to be returned, which
> > turned out to be problematic and the previous behavior in that respect
> > was restored by commit 4b20b07ce72f ("cpuidle: teo: Don't count non-
> > existent intercepts").
> >
> > However, commit 6da8f9ba5a87 also caused the governor to avoid calling
> > tick_nohz_get_sleep_length() on systems where idle state 0 is a "polling"
> > one (that is, it is not really an idle state, but a loop continuously
> > executed by the CPU) when the target residency of the idle state to be
> > returned was low enough, so there was no practical need to refine the
> > idle state selection in any way.  This change was not removed by the
> > other commit, so now on systems where idle state 0 is a "polling" one,
> > tick_nohz_get_sleep_length() is called when idle state 0 is to be
> > returned, but it is not called when a deeper idle state with
> > sufficiently low target residency is to be returned.  That is arguably
> > confusing and inconsistent.
> >
> > Moreover, there is no specific reason why the behavior in question
> > should depend on whether or not idle state 0 is a "polling" one.
> >
> > One way to address this would be to make the governor always call
> > tick_nohz_get_sleep_length() to obtain the sleep length, but that would
> > effectively mean reverting commit 6da8f9ba5a87 and restoring the latency
> > issue that was the reason for doing it.  This approach is thus not
> > particularly attractive.
> >
> > To address it differently, notice that if a CPU is woken up very often,
> > this is not likely to be caused by timers in the first place (user space
> > has a default timer slack of 50 us and there are relatively few timers
> > with a deadline shorter than several microseconds in the kernel) and
> > even if it were the case, the potential benefit from using a deep idle
> > state would then be questionable for latency reasons.  Therefore, if the
> > majority of CPU wakeups occur within several microseconds, it can be
> > assumed that all wakeups in that range are non-timer and the sleep
> > length need not be determined.
> >
> > Accordingly, introduce a new metric for counting wakeups with the
> > measured idle duration below RESIDENCY_THRESHOLD_NS and modify the idle
> > state selection to skip the tick_nohz_get_sleep_length() invocation if
> > idle state 0 has been selected or the target residency of the candidate
> > idle state is below RESIDENCY_THRESHOLD_NS and the value of the new
> > metric is at least 1/2 of the total event count.
> >
> > Since the above requires the measured idle duration to be determined
> > every time, except for the cases when one of the safety nets has
> > triggered in which the wakeup is counted as a hit in the deepest
> > idle state idle residency range, update the handling of those cases
> > to avoid skipping the idle duration computation when the CPU wakeup
> > is "genuine".
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> > ---
> >  drivers/cpuidle/governors/teo.c |   58 ++++++++++++++++++++++++----------------
> >  1 file changed, 36 insertions(+), 22 deletions(-)
> >
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -129,6 +129,7 @@
> >   * @state_bins: Idle state data bins for this CPU.
> >   * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
> >   * @tick_intercepts: "Intercepts" before TICK_NSEC.
> > + * @short_idle: Wakeups after short idle periods.
> >   */
> >  struct teo_cpu {
> >       s64 time_span_ns;
> > @@ -136,6 +137,7 @@
> >       struct teo_bin state_bins[CPUIDLE_STATE_MAX];
> >       unsigned int total;
> >       unsigned int tick_intercepts;
> > +     unsigned int short_idle;
>
> Maybe call these short_idles?

I renamed it accordingly when applying the patch.

> >  };
> >
> >  static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
> > @@ -152,12 +154,12 @@
> >       s64 target_residency_ns;
> >       u64 measured_ns;
> >
> > -     if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
> > +     cpu_data->short_idle -= cpu_data->short_idle >> DECAY_SHIFT;
> > +
> > +     if (cpu_data->time_span_ns < 0) {
> >               /*
> > -              * This causes the wakeup to be counted as a hit regardless of
> > -              * regardless of the real idle duration which doesn't need to be
> > -              * computed because the wakeup has been close enough to an
> > -              * anticipated timer.
> > +              * If one of the safety nets has triggered, assume that this
> > +              * might have been a long sleep.
> >                */
> >               measured_ns = U64_MAX;
> >       } else {
> > @@ -177,10 +179,14 @@
> >                * time, so take 1/2 of the exit latency as a very rough
> >                * approximation of the average of it.
> >                */
> > -             if (measured_ns >= lat_ns)
> > +             if (measured_ns >= lat_ns) {
> >                       measured_ns -= lat_ns / 2;
> > -             else
> > +                     if (measured_ns < RESIDENCY_THRESHOLD_NS)
> > +                             cpu_data->short_idle += PULSE;
> > +             } else {
> >                       measured_ns /= 2;
> > +                     cpu_data->short_idle += PULSE;
> > +             }
> >       }
> >
> >       cpu_data->total = 0;
> > @@ -419,27 +425,35 @@
> >       if (idx > constraint_idx)
> >               idx = constraint_idx;
> >
> > -     if (!idx) {
> > -             /*
> > -              * Query the sleep length to be able to count the wakeup as a
> > -              * hit if it is caused by a timer.
> > -              */
> > -             cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);
> > -             goto out_tick;
> > -     }
> > -
> >       /*
> > -      * If state 0 is a polling one, check if the target residency of
> > -      * the current candidate state is low enough and skip the timers
> > -      * check in that case too.
> > +      * If either the candidate state is state 0 or its target residency is
> > +      * low enough, there is basically nothing more to do, but if the sleep
> > +      * length is not updated, the subsequent wakeup will be counted as an
> > +      * "intercept" which may be problematic in the cases when timer wakeups
> > +      * are dominant.  Namely, it may effectively prevent deeper idle states
> > +      * from being selected at one point even if no imminent timers are
> > +      * scheduled.
> > +      *
> > +      * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one
> > +      * CPU are unlikely (user space has a default 50 us slack value for
> > +      * hrtimers and there are relatively few timers with a lower deadline
> > +      * value in the kernel), and even if they did happene, the potential
>
> s/happene/happen

Fixed when applying the patch.

> > +      * benefit from using a deep idle state in that case would be
> > +      * questionable anyway for latency reasons.  Thus if the measured idle
> > +      * duration falls into that range in the majority of cases, assume
> > +      * non-timer wakeups to be dominant and skip updating the sleep length
> > +      * to reduce latency.
> >        */
> > -     if ((drv->states[0].flags & CPUIDLE_FLAG_POLLING) &&
> > -         drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS)
> > +     if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) &&
> > +         2 * cpu_data->short_idle >= cpu_data->total)
> >               goto out_tick;
> >
> >       duration_ns = tick_nohz_get_sleep_length(&delta_tick);
> >       cpu_data->sleep_length_ns = duration_ns;
> >
> > +     if (!idx)
> > +             goto out_tick;
> > +
> >       /*
> >        * If the closest expected timer is before the target residency of the
> >        * candidate state, a shallower one needs to be found.
> > @@ -501,7 +515,7 @@
> >       if (dev->poll_time_limit ||
> >           (tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) {
> >               dev->poll_time_limit = false;
> > -             cpu_data->time_span_ns = cpu_data->sleep_length_ns;
> > +             cpu_data->time_span_ns = KTIME_MIN;
> >       } else {
> >               cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns;
> >       }
> >
>
> Thanks, I like this approach.
> Reviewed-by: Christian Loehle <christian.loehle@....com>

Thank you!