lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1541446371.3441.9.camel@suse.cz>
Date:   Mon, 05 Nov 2018 20:32:51 +0100
From:   Giovanni Gherdovich <ggherdovich@...e.cz>
To:     "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Linux PM <linux-pm@...r.kernel.org>,
        Doug Smythies <dsmythies@...us.net>
Cc:     Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Mel Gorman <mgorman@...e.de>,
        Daniel Lezcano <daniel.lezcano@...aro.org>
Subject: Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor
 for tickless systems

On Sun, 2018-11-04 at 17:31 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> 
> The venerable menu governor does some thigns that are quite
> questionable in my view.  First, it includes timer wakeups in
> the pattern detection data and mixes them up with wakeups from
> other sources which in some cases causes it to expect what
> essentially would be a timer wakeup in a time frame in which
> no timer wakeups are possible (becuase it knows the time until
> the next timer event and that is later than the expected wakeup
> time).  Second, it uses the extra exit latency limit based on
> the predicted idle duration and depending on the number of tasks
> waiting on I/O, even though those tasks may run on a different
> CPU when they are woken up.  Moreover, the time ranges used by it
> for the sleep length correction factors depend on whether or not
> there are tasks waiting on I/O, which again doesn't imply anything
> in particular, and they are not correlated to the list of available
> idle states in any way whatever.  Also,  the pattern detection code
> in menu may end up considering values that are too large to matter
> at all, in which cases running it is a waste of time.
> 
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.  It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
> 
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.  First, it
> doesn't use "correction factors" for the time till the closest timer,
> but instead it tries to correlate the measured idle duration values
> with the available idle states and use that information to pick up
> the idle state that is most likely to "match" the upcoming CPU idle
> interval.  Second, it doesn't take the number of "I/O waiters" into
> account at all and the pattern detection code in it tries to avoid
> taking timer wakeups into account.  It also only uses idle duration
> values less than the current time till the closest timer (with the
> tick excluded) for that purpose.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> ---
> 
> v2 -> v3:
>  * Simplify the pattern detection code and make it return a value
>         lower than the time to the closest timer if the majority of recent
>         idle intervals are below it regardless of their variance (that should
>         cause it to be slightly more aggressive).
>  * Do not count wakeups from state 0 due to the time limit in poll_idle()
>    as non-timer.
> 
> Note: I will be mostly offline tomorrow, so this goes slightly early.
> I have tested it only very lightly, but it is not so much different from
> the previous one.
> 
> It requires the same additional patches to apply as the previous one too.
> 
> ---
>  drivers/cpuidle/Kconfig            |   11 
>  drivers/cpuidle/governors/Makefile |    1 
>  drivers/cpuidle/governors/teo.c    |  491 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 503 insertions(+)
> 
> Index: linux-pm/drivers/cpuidle/governors/teo.c
> ===================================================================
> --- /dev/null
> +++ linux-pm/drivers/cpuidle/governors/teo.c
> @@ -0,0 +1,491 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Timer events oriented CPU idle governor
> + *
> + * Copyright (C) 2018 Intel Corporation
> + * Author: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> + *
> + * The idea of this governor is based on the observation that on many systems
> + * timer events are two or more orders of magnitude more frequent than any
> + * other interrupts, so they are likely to be the most significant source of CPU
> + * wakeups from idle states.  Moreover, information about what happened in the
> + * (relatively recent) past can be used to estimate whether or not the deepest
> + * idle state with target residency within the time to the closest timer is
> + * likely to be suitable for the upcoming idle time of the CPU and, if not, then
> + * which of the shallower idle states to choose.
> + *
> + * Of course, non-timer wakeup sources are more important in some use cases and
> + * they can be covered by detecting patterns among recent idle time intervals
> + * of the CPU.  However, even in that case it is not necessary to take idle
> + * duration values greater than the time till the closest timer into account, as
> + * the patterns that they may belong to produce average values close enough to
> + * the time till the closest timer (sleep length) anyway.
> + *
> + * Thus this governor estimates whether or not the upcoming idle time of the CPU
> + * is likely to be significantly shorter than the sleep length and selects an
> + * idle state for it in accordance with that, as follows:
> + *
> + * - If there is a pattern of 5 or more recent non-timer wakeups earlier than
> + *   the closest timer event, expect one more of them to occur and use the
> + *   average of the idle duration values corresponding to them to select an
> + *   idle state for the CPU.
> + *
> + * - Otherwise, find the state on the basis of the sleep length and state
> + *   statistics collected over time:
> + *
> + *   o Find the deepest idle state whose target residency is less than or euqal
> + *     to the sleep length.
> + *
> + *   o Select it if it matched both the sleep length and the idle duration
> + *     measured after wakeup in the past more often than it matched the sleep
> + *     length, but not the idle duration (i.e. the measured idle duration was
> + *     significantly shorter than the sleep length matched by that state).
> + *
> + *   o Otherwise, select the shallower state with the greatest matched "early"
> + *     wakeups metric.
> + */
> +
> +#include <linux/cpuidle.h>
> +#include <linux/jiffies.h>
> +#include <linux/kernel.h>
> +#include <linux/sched/clock.h>
> +#include <linux/tick.h>
> +
> +/*
> + * The SPIKE value is added to metrics when they grow and the DECAY_SHIFT value
> + * is used for decreasing metrics on a regular basis.
> + */
> +#define SPIKE          1024
> +#define DECAY_SHIFT    3
> +
> +/*
> + * Number of the most recent idle duration values to take into consideration for
> + * the detection of wakeup patterns.
> + */
> +#define INTERVALS      8
> +
> +/*
> + * Ratio of the sample spread limit and the length of the interesting intervals
> + * range used for pattern detection, reptesented as a shift.
> + */
> +#define MAX_SPREAD_SHIFT       3
> +
> +/**
> + * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
> + * @early_hits: "Early" CPU wakeups matched by this state.
> + * @hits: "On time" CPU wakeups matched by this state.
> + * @misses: CPU wakeups "missed" by this state.
> + *
> + * A CPU wakeup is "matched" by a given idle state if the idle duration measured
> + * after the wakeup is between the target residency of that state and the target
> + * residnecy of the next one (or if this is the deepest available idle state, it
> + * "matches" a CPU wakeup when the measured idle duration is at least equal to
> + * its target residency).
> + *
> + * Also, from the TEO governor prespective, a CPU wakeup from idle is "early" if
> + * it occurs significantly earlier than the closest expected timer event (that
> + * is, early enough to match an idle state shallower than the one matching the
> + * time till the closest timer event).  Otherwise, the wakeup is "on time", or
> + * it is a "hit".
> + *
> + * A "miss" occurs when the given state doesn't match the wakeup, but it matches
> + * the time till the closest timer event used for idle state selection.
> + */
> +struct teo_idle_state {
> +       unsigned int early_hits;
> +       unsigned int hits;
> +       unsigned int misses;
> +};
> +
> +/**
> + * struct teo_cpu - CPU data used by the TEO cpuidle governor.
> + * @time_span_ns: Time between idle state selection and post-wakeup update.
> + * @sleep_length_ns: Time till the closest timer event (at the selection time).
> + * @states: Idle states data corresponding to this CPU.
> + * @last_state: Idle state entered by the CPU last time.
> + * @interval_idx: Index of the most recent saved idle interval.
> + * @intervals: Saved idle duration values.
> + */
> +struct teo_cpu {
> +       u64 time_span_ns;
> +       u64 sleep_length_ns;
> +       struct teo_idle_state states[CPUIDLE_STATE_MAX];
> +       int last_state;
> +       int interval_idx;
> +       unsigned int intervals[INTERVALS];
> +};
> +
> +static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
> +
> +/**
> + * teo_update - Update CPU data after wakeup.
> + * @drv: cpuidle driver containing state data.
> + * @dev: Target CPU.
> + */
> +static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> +{
> +       struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> +       unsigned int sleep_length_us = ktime_to_us(cpu_data->sleep_length_ns);
> +       int i, idx_hit = -1, idx_timer = -1;
> +       unsigned int measured_us;
> +
> +       if (cpu_data->time_span_ns == cpu_data->sleep_length_ns) {
> +               /* One of the safety nets has triggered (most likely). */
> +               measured_us = sleep_length_us;
> +       } else {
> +               measured_us = dev->last_residency;
> +               i = cpu_data->last_state;
> +               if (measured_us >= 2 * drv->states[i].exit_latency)
> +                       measured_us -= drv->states[i].exit_latency;
> +               else
> +                       measured_us /= 2;
> +       }
> +

I haven't read this v3 yet, so just a little comment on the bit above (which
is there since v1).

When you check for measured_us >= 2 * exit_latency, is that because
dev->last_residency is composed by an "entry" latency, then the actual
residency, and finally the exit_latency? I'm asking about the 2x factor
there.

If that succeeds, you proceed to remove the exit_latency from
measured_us... just once. Given how the condition is formulated, I expected
measured_us -= 2 * exit_latency there.

More: you acknowledge, in that snippet of code, that there can be
dev->last_residency's smaller than twice the exit_latency, i.e. not even the
time to entry/exit the state. Am I reading this right? Is that because both
exit_latency and dev->last_residency are only approximations?

I actually see quite a few of those extra-short residencies in my traces, even
with dev->last_residency < exit_latency.


Giovanni

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ