linux-kernel - Re: [RFC PATCH V2 02/19] sched/power: Move idle state selection into the scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 18 Aug 2014 22:49:52 +0530
From:	Preeti U Murthy <preeti@...ux.vnet.ibm.com>
To:	Nicolas Pitre <nicolas.pitre@...aro.org>
CC:	alex.shi@...el.com, vincent.guittot@...aro.org,
	peterz@...radead.org, pjt@...gle.com, efault@....de,
	rjw@...ysocki.net, morten.rasmussen@....com,
	svaidy@...ux.vnet.ibm.com, arjan@...ux.intel.com, mingo@...nel.org,
	len.brown@...el.com, yuyang.du@...el.com,
	linaro-kernel@...ts.linaro.org, daniel.lezcano@...aro.org,
	corbet@....net, catalin.marinas@....com, markgross@...gnar.org,
	sundar.iyer@...el.com, linux-kernel@...r.kernel.org,
	dietmar.eggemann@....com, Lorenzo.Pieralisi@....com,
	mike.turquette@...aro.org, akpm@...ux-foundation.org,
	paulmck@...ux.vnet.ibm.com, tglx@...utronix.de
Subject: Re: [RFC PATCH V2 02/19] sched/power: Move idle state selection into
 the scheduler

On 08/18/2014 09:24 PM, Nicolas Pitre wrote:
> On Mon, 11 Aug 2014, Preeti U Murthy wrote:
> 
>> The goal of the power aware scheduling design is to integrate all
>> policy, metrics and averaging into the scheduler. Today the
>> cpu power management is fragmented and hence inconsistent.
>>
>> As a first step towards this integration, rid the cpuidle state management
>> of the governors. Retain only the cpuidle driver in the cpu idle
>> susbsystem which acts as an interface between the scheduler and low
>> level platform specific cpuidle drivers. For all decision making around
>> selection of idle states,the cpuidle driver falls back to the scheduler.
>>
>> The current algorithm for idle state selection is the same as the logic used
>> by the menu governor. However going ahead the heuristics will be tuned and
>> improved upon with metrics better known to the scheduler.
> 
> I'd strongly suggest a different approach here.  Instead of copying the 
> menu governor code and tweaking it afterwards, it would be cleaner to 
> literally start from scratch with a new governor.  Said new governor 
> would grow inside the scheduler with more design freedom instead of 
> being strapped on the side.
> 
> By copying existing code, the chance for cruft to remain for a long time 
> is close to 100%. We already have one copy of it, let's keep it working 
> and start afresh instead.
> 
> By starting clean it is way easier to explain and justify additions to a 
> new design than convincing ourselves about the removal of no longer 
> needed pieces from a legacy design.

Ok. The reason I did it this way was that I did not find anything
grossly wrong in the current cpuidle governor algorithm. Of course this
can be improved but I did not see strong reasons to completely wipe it
away. I see good scope to improve upon the existing algorithm with
additional knowledge of *the idle states being mapped to scheduling
domains*. This will in itself give us a better algorithm and does not
mandate significant changes from the current algorithm. So I really
don't see why we need to start from scratch.

The primary issue that I found was that with the goal being power aware
scheduler we must ensure that the possibility of a governor getting
registered with cpuidle to choose idle states no longer will exist. The
reason being there is just *one entity who will take this decision and
there is no option about it*. This patch intends to bring the focus to
this specific detail.

Regards
Preeti U Murthy
> 
> 
>>
>> Note: cpufrequency is still left disabled when CONFIG_SCHED_POWER is selected.
>>
>> Signed-off-by: Preeti U Murthy <preeti@...ux.vnet.ibm.com>
>> ---
>>
>>  drivers/cpuidle/Kconfig           |   12 +
>>  drivers/cpuidle/cpuidle-powernv.c |    2 
>>  drivers/cpuidle/cpuidle.c         |   65 ++++-
>>  include/linux/sched.h             |    9 +
>>  kernel/sched/Makefile             |    1 
>>  kernel/sched/power.c              |  480 +++++++++++++++++++++++++++++++++++++
>>  6 files changed, 554 insertions(+), 15 deletions(-)
>>  create mode 100644 kernel/sched/power.c
>>
>> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
>> index 2c4ac79..4fa4cb1 100644
>> --- a/drivers/cpuidle/Kconfig
>> +++ b/drivers/cpuidle/Kconfig
>> @@ -3,16 +3,14 @@ menu "CPU Idle"
>>  config CPU_IDLE
>>  	bool "CPU idle PM support"
>>  	default y if ACPI || PPC_PSERIES
>> -	depends on !SCHED_POWER
>> -	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
>> -	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
>> +	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE && !SCHED_POWER)
>> +	select CPU_IDLE_GOV_MENU if ((NO_HZ || NO_HZ_IDLE) && !SCHED_POWER)
>>  	help
>>  	  CPU idle is a generic framework for supporting software-controlled
>>  	  idle processor power management.  It includes modular cross-platform
>>  	  governors that can be swapped during runtime.
>>  
>>  	  If you're using an ACPI-enabled platform, you should say Y here.
>> -	  This feature will turn off if power aware scheduling is enabled.
>>  
>>  if CPU_IDLE
>>  
>> @@ -22,10 +20,16 @@ config CPU_IDLE_MULTIPLE_DRIVERS
>>  config CPU_IDLE_GOV_LADDER
>>  	bool "Ladder governor (for periodic timer tick)"
>>  	default y
>> +	depends on !SCHED_POWER
>> +	help
>> +	  This feature will turn off if power aware scheduling is enabled.
>>  
>>  config CPU_IDLE_GOV_MENU
>>  	bool "Menu governor (for tickless system)"
>>  	default y
>> +	depends on !SCHED_POWER
>> +	help
>> +	  This feature will turn off if power aware scheduling is enabled.
>>  
>>  menu "ARM CPU Idle Drivers"
>>  depends on ARM
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>> index fa79392..95ef533 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
>>  	unsigned long new_lpcr;
>>  
>>  	if (powersave_nap < 2)
>> -		return;
>> +		return 0;
>>  	if (unlikely(system_state < SYSTEM_RUNNING))
>>  		return index;
>>  
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index ee9df5e..38fb213 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -150,6 +150,19 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>>  	return entered_state;
>>  }
>>  
>> +#ifdef CONFIG_SCHED_POWER
>> +static int __cpuidle_select(struct cpuidle_driver *drv,
>> +				struct cpuidle_device *dev)
>> +{
>> +	return cpuidle_sched_select(drv, dev);
>> +}
>> +#else
>> +static int __cpuidle_select(struct cpuidle_driver *drv,
>> +				struct cpuidle_device *dev)
>> +{
>> +	return cpuidle_curr_governor->select(drv, dev);	
>> +}
>> +#endif
>>  /**
>>   * cpuidle_select - ask the cpuidle framework to choose an idle state
>>   *
>> @@ -169,7 +182,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>>  	if (unlikely(use_deepest_state))
>>  		return cpuidle_find_deepest_state(drv, dev);
>>  
>> -	return cpuidle_curr_governor->select(drv, dev);
>> +	return __cpuidle_select(drv, dev);
>>  }
>>  
>>  /**
>> @@ -190,6 +203,18 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>  	return cpuidle_enter_state(dev, drv, index);
>>  }
>>  
>> +#ifdef CONFIG_SCHED_POWER
>> +static void __cpuidle_reflect(struct cpuidle_device *dev, int index)
>> +{
>> +	cpuidle_sched_reflect(dev, index);
>> +}
>> +#else
>> +static void __cpuidle_reflect(struct cpuidle_device *dev, int index)
>> +{
>> +	if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
>> +		cpuidle_curr_governor->reflect(dev, index);
>> +}
>> +#endif
>>  /**
>>   * cpuidle_reflect - tell the underlying governor what was the state
>>   * we were in
>> @@ -200,8 +225,7 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>   */
>>  void cpuidle_reflect(struct cpuidle_device *dev, int index)
>>  {
>> -	if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
>> -		cpuidle_curr_governor->reflect(dev, index);
>> +	__cpuidle_reflect(dev, index);
>>  }
>>  
>>  /**
>> @@ -265,6 +289,28 @@ void cpuidle_resume(void)
>>  	mutex_unlock(&cpuidle_lock);
>>  }
>>  
>> +#ifdef CONFIG_SCHED_POWER
>> +static int cpuidle_check_governor(struct cpuidle_driver *drv,
>> +					struct cpuidle_device *dev, int enable)
>> +{
>> +	if (enable)
>> +		return cpuidle_sched_enable_device(drv, dev);
>> +	else
>> +		return 0;
>> +}
>> +#else
>> +static int cpuidle_check_governor(struct cpuidle_driver *drv,
>> +					struct cpuidle_device *dev, int enable)
>> +{
>> +	if (!cpuidle_curr_governor)
>> +		return -EIO;
>> +
>> +	if (enable && cpuidle_curr_governor->enable)
>> +		return cpuidle_curr_governor->enable(drv, dev);
>> +	else if (cpuidle_curr_governor->disable)
>> +		cpuidle_curr_governor->disable(drv, dev);
>> +}
>> +#endif
>>  /**
>>   * cpuidle_enable_device - enables idle PM for a CPU
>>   * @dev: the CPU
>> @@ -285,7 +331,7 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
>>  
>>  	drv = cpuidle_get_cpu_driver(dev);
>>  
>> -	if (!drv || !cpuidle_curr_governor)
>> +	if (!drv)
>>  		return -EIO;
>>  
>>  	if (!dev->registered)
>> @@ -298,8 +344,8 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
>>  	if (ret)
>>  		return ret;
>>  
>> -	if (cpuidle_curr_governor->enable &&
>> -	    (ret = cpuidle_curr_governor->enable(drv, dev)))
>> +	ret = cpuidle_check_governor(drv, dev, 1);
>> +	if (ret)
>>  		goto fail_sysfs;
>>  
>>  	smp_wmb();
>> @@ -331,13 +377,12 @@ void cpuidle_disable_device(struct cpuidle_device *dev)
>>  	if (!dev || !dev->enabled)
>>  		return;
>>  
>> -	if (!drv || !cpuidle_curr_governor)
>> +	if (!drv)
>>  		return;
>> -
>> +	
>>  	dev->enabled = 0;
>>  
>> -	if (cpuidle_curr_governor->disable)
>> -		cpuidle_curr_governor->disable(drv, dev);
>> +	cpuidle_check_governor(drv, dev, 0);
>>  
>>  	cpuidle_remove_device_sysfs(dev);
>>  	enabled_devices--;
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 7c19d55..5dd99b5 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -26,6 +26,7 @@ struct sched_param {
>>  #include <linux/nodemask.h>
>>  #include <linux/mm_types.h>
>>  #include <linux/preempt_mask.h>
>> +#include <linux/cpuidle.h>
>>  
>>  #include <asm/page.h>
>>  #include <asm/ptrace.h>
>> @@ -846,6 +847,14 @@ enum cpu_idle_type {
>>  	CPU_MAX_IDLE_TYPES
>>  };
>>  
>> +#ifdef CONFIG_SCHED_POWER
>> +extern void cpuidle_sched_reflect(struct cpuidle_device *dev, int index);
>> +extern int cpuidle_sched_select(struct cpuidle_driver *drv,
>> +					struct cpuidle_device *dev);
>> +extern int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
>> +						struct cpuidle_device *dev);
>> +#endif
>> +
>>  /*
>>   * Increase resolution of cpu_capacity calculations
>>   */
>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>> index ab32b7b..5b8e469 100644
>> --- a/kernel/sched/Makefile
>> +++ b/kernel/sched/Makefile
>> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
>>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
>> +obj-$(CONFIG_SCHED_POWER) += power.o
>> diff --git a/kernel/sched/power.c b/kernel/sched/power.c
>> new file mode 100644
>> index 0000000..63c9276
>> --- /dev/null
>> +++ b/kernel/sched/power.c
>> @@ -0,0 +1,480 @@
>> +/*
>> + * power.c - the power aware scheduler
>> + *
>> + * Author:
>> + *        Preeti U. Murthy <preeti@...ux.vnet.ibm.com>
>> + *
>> + * This code is a replica of drivers/cpuidle/governors/menu.c
>> + * To make the transition to power aware scheduler away from
>> + * the cpuidle governor model easy, we do exactly what the
>> + * governors do for now. Going ahead the heuristics will be
>> + * tuned and improved upon.
>> + *
>> + * This code is licenced under the GPL version 2 as described
>> + * in the COPYING file that acompanies the Linux Kernel.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/cpuidle.h>
>> +#include <linux/pm_qos.h>
>> +#include <linux/time.h>
>> +#include <linux/ktime.h>
>> +#include <linux/hrtimer.h>
>> +#include <linux/tick.h>
>> +#include <linux/sched.h>
>> +#include <linux/math64.h>
>> +#include <linux/module.h>
>> +
>> +/*
>> + * Please note when changing the tuning values:
>> + * If (MAX_INTERESTING-1) * RESOLUTION > UINT_MAX, the result of
>> + * a scaling operation multiplication may overflow on 32 bit platforms.
>> + * In that case, #define RESOLUTION as ULL to get 64 bit result:
>> + * #define RESOLUTION 1024ULL
>> + *
>> + * The default values do not overflow.
>> + */
>> +#define BUCKETS 12
>> +#define INTERVALS 8
>> +#define RESOLUTION 1024
>> +#define DECAY 8
>> +#define MAX_INTERESTING 50000
>> +
>> +
>> +/*
>> + * Concepts and ideas behind the power aware scheduler
>> + *
>> + * For the power aware scheduler, there are 3 decision factors for picking a C
>> + * state:
>> + * 1) Energy break even point
>> + * 2) Performance impact
>> + * 3) Latency tolerance (from pmqos infrastructure)
>> + * These these three factors are treated independently.
>> + *
>> + * Energy break even point
>> + * -----------------------
>> + * C state entry and exit have an energy cost, and a certain amount of time in
>> + * the  C state is required to actually break even on this cost. CPUIDLE
>> + * provides us this duration in the "target_residency" field. So all that we
>> + * need is a good prediction of how long we'll be idle. Like the traditional
>> + * governors, we start with the actual known "next timer event" time.
>> + *
>> + * Since there are other source of wakeups (interrupts for example) than
>> + * the next timer event, this estimation is rather optimistic. To get a
>> + * more realistic estimate, a correction factor is applied to the estimate,
>> + * that is based on historic behavior. For example, if in the past the actual
>> + * duration always was 50% of the next timer tick, the correction factor will
>> + * be 0.5.
>> + *
>> + * power aware scheduler uses a running average for this correction factor,
>> + * however it uses a set of factors, not just a single factor. This stems from
>> + * the realization that the ratio is dependent on the order of magnitude of the
>> + * expected duration; if we expect 500 milliseconds of idle time the likelihood of
>> + * getting an interrupt very early is much higher than if we expect 50 micro
>> + * seconds of idle time. A second independent factor that has big impact on
>> + * the actual factor is if there is (disk) IO outstanding or not.
>> + * (as a special twist, we consider every sleep longer than 50 milliseconds
>> + * as perfect; there are no power gains for sleeping longer than this)
>> + *
>> + * For these two reasons we keep an array of 12 independent factors, that gets
>> + * indexed based on the magnitude of the expected duration as well as the
>> + * "is IO outstanding" property.
>> + *
>> + * Repeatable-interval-detector
>> + * ----------------------------
>> + * There are some cases where "next timer" is a completely unusable predictor:
>> + * Those cases where the interval is fixed, for example due to hardware
>> + * interrupt mitigation, but also due to fixed transfer rate devices such as
>> + * mice.
>> + * For this, we use a different predictor: We track the duration of the last 8
>> + * intervals and if the stand deviation of these 8 intervals is below a
>> + * threshold value, we use the average of these intervals as prediction.
>> + *
>> + * Limiting Performance Impact
>> + * ---------------------------
>> + * C states, especially those with large exit latencies, can have a real
>> + * noticeable impact on workloads, which is not acceptable for most sysadmins,
>> + * and in addition, less performance has a power price of its own.
>> + *
>> + * As a general rule of thumb, power aware sched assumes that the following
>> + * heuristic holds:
>> + *     The busier the system, the less impact of C states is acceptable
>> + *
>> + * This rule-of-thumb is implemented using a performance-multiplier:
>> + * If the exit latency times the performance multiplier is longer than
>> + * the predicted duration, the C state is not considered a candidate
>> + * for selection due to a too high performance impact. So the higher
>> + * this multiplier is, the longer we need to be idle to pick a deep C
>> + * state, and thus the less likely a busy CPU will hit such a deep
>> + * C state.
>> + *
>> + * Two factors are used in determing this multiplier:
>> + * a value of 10 is added for each point of "per cpu load average" we have.
>> + * a value of 5 points is added for each process that is waiting for
>> + * IO on this CPU.
>> + * (these values are experimentally determined)
>> + *
>> + * The load average factor gives a longer term (few seconds) input to the
>> + * decision, while the iowait value gives a cpu local instantanious input.
>> + * The iowait factor may look low, but realize that this is also already
>> + * represented in the system load average.
>> + *
>> + */
>> +
>> +struct sched_cpuidle_info {
>> +	int		last_state_idx;
>> +	int             needs_update;
>> +
>> +	unsigned int	next_timer_us;
>> +	unsigned int	predicted_us;
>> +	unsigned int	bucket;
>> +	unsigned int	correction_factor[BUCKETS];
>> +	unsigned int	intervals[INTERVALS];
>> +	int		interval_ptr;
>> +};
>> +
>> +
>> +#define LOAD_INT(x) ((x) >> FSHIFT)
>> +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
>> +
>> +static int get_loadavg(void)
>> +{
>> +	unsigned long this = this_cpu_load();
>> +
>> +
>> +	return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10;
>> +}
>> +
>> +static inline int which_bucket(unsigned int duration)
>> +{
>> +	int bucket = 0;
>> +
>> +	/*
>> +	 * We keep two groups of stats; one with no
>> +	 * IO pending, one without.
>> +	 * This allows us to calculate
>> +	 * E(duration)|iowait
>> +	 */
>> +	if (nr_iowait_cpu(smp_processor_id()))
>> +		bucket = BUCKETS/2;
>> +
>> +	if (duration < 10)
>> +		return bucket;
>> +	if (duration < 100)
>> +		return bucket + 1;
>> +	if (duration < 1000)
>> +		return bucket + 2;
>> +	if (duration < 10000)
>> +		return bucket + 3;
>> +	if (duration < 100000)
>> +		return bucket + 4;
>> +	return bucket + 5;
>> +}
>> +
>> +/*
>> + * Return a multiplier for the exit latency that is intended
>> + * to take performance requirements into account.
>> + * The more performance critical we estimate the system
>> + * to be, the higher this multiplier, and thus the higher
>> + * the barrier to go to an expensive C state.
>> + */
>> +static inline int performance_multiplier(void)
>> +{
>> +	int mult = 1;
>> +
>> +	/* for higher loadavg, we are more reluctant */
>> +
>> +	mult += 2 * get_loadavg();
>> +
>> +	/* for IO wait tasks (per cpu!) we add 5x each */
>> +	mult += 10 * nr_iowait_cpu(smp_processor_id());
>> +
>> +	return mult;
>> +}
>> +
>> +static DEFINE_PER_CPU(struct sched_cpuidle_info, cpuidle_info );
>> +
>> +static void cpuidle_sched_update(struct cpuidle_driver *drv,
>> +					struct cpuidle_device *dev);
>> +
>> +/* This implements DIV_ROUND_CLOSEST but avoids 64 bit division */
>> +static u64 div_round64(u64 dividend, u32 divisor)
>> +{
>> +	return div_u64(dividend + (divisor / 2), divisor);
>> +}
>> +
>> +/*
>> + * Try detecting repeating patterns by keeping track of the last 8
>> + * intervals, and checking if the standard deviation of that set
>> + * of points is below a threshold. If it is... then use the
>> + * average of these 8 points as the estimated value.
>> + */
>> +static void get_typical_interval(struct sched_cpuidle_info *data)
>> +{
>> +	int i, divisor;
>> +	unsigned int max, thresh;
>> +	uint64_t avg, stddev;
>> +
>> +	thresh = UINT_MAX; /* Discard outliers above this value */
>> +
>> +again:
>> +
>> +	/* First calculate the average of past intervals */
>> +	max = 0;
>> +	avg = 0;
>> +	divisor = 0;
>> +	for (i = 0; i < INTERVALS; i++) {
>> +		unsigned int value = data->intervals[i];
>> +		if (value <= thresh) {
>> +			avg += value;
>> +			divisor++;
>> +			if (value > max)
>> +				max = value;
>> +		}
>> +	}
>> +	do_div(avg, divisor);
>> +
>> +	/* Then try to determine standard deviation */
>> +	stddev = 0;
>> +	for (i = 0; i < INTERVALS; i++) {
>> +		unsigned int value = data->intervals[i];
>> +		if (value <= thresh) {
>> +			int64_t diff = value - avg;
>> +			stddev += diff * diff;
>> +		}
>> +	}
>> +	do_div(stddev, divisor);
>> +	/*
>> +	 * The typical interval is obtained when standard deviation is small
>> +	 * or standard deviation is small compared to the average interval.
>> +	 *
>> +	 * int_sqrt() formal parameter type is unsigned long. When the
>> +	 * greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
>> +	 * the resulting squared standard deviation exceeds the input domain
>> +	 * of int_sqrt on platforms where unsigned long is 32 bits in size.
>> +	 * In such case reject the candidate average.
>> +	 *
>> +	 * Use this result only if there is no timer to wake us up sooner.
>> +	 */
>> +	if (likely(stddev <= ULONG_MAX)) {
>> +		stddev = int_sqrt(stddev);
>> +		if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
>> +							|| stddev <= 20) {
>> +			if (data->next_timer_us > avg)
>> +				data->predicted_us = avg;
>> +			return;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * If we have outliers to the upside in our distribution, discard
>> +	 * those by setting the threshold to exclude these outliers, then
>> +	 * calculate the average and standard deviation again. Once we get
>> +	 * down to the bottom 3/4 of our samples, stop excluding samples.
>> +	 *
>> +	 * This can deal with workloads that have long pauses interspersed
>> +	 * with sporadic activity with a bunch of short pauses.
>> +	 */
>> +	if ((divisor * 4) <= INTERVALS * 3)
>> +		return;
>> +
>> +	thresh = max - 1;
>> +	goto again;
>> +}
>> +
>> +/**
>> + * cpuidle_sched_select - selects the next idle state to enter
>> + * @drv: cpuidle driver containing state data
>> + * @dev: the CPU
>> + */
>> +int cpuidle_sched_select(struct cpuidle_driver *drv,
>> +				struct cpuidle_device *dev)
>> +{
>> +	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
>> +	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
>> +	int i;
>> +	unsigned int interactivity_req;
>> +	struct timespec t;
>> +
>> +	if (data->needs_update) {
>> +		cpuidle_sched_update(drv, dev);
>> +		data->needs_update = 0;
>> +	}
>> +
>> +	data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
>> +
>> +	/* Special case when user has set very strict latency requirement */
>> +	if (unlikely(latency_req == 0))
>> +		return 0;
>> +
>> +	/* determine the expected residency time, round up */
>> +	t = ktime_to_timespec(tick_nohz_get_sleep_length());
>> +	data->next_timer_us =
>> +		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
>> +
>> +
>> +	data->bucket = which_bucket(data->next_timer_us);
>> +
>> +	/*
>> +	 * Force the result of multiplication to be 64 bits even if both
>> +	 * operands are 32 bits.
>> +	 * Make sure to round up for half microseconds.
>> +	 */
>> +	data->predicted_us = div_round64((uint64_t)data->next_timer_us *
>> +					 data->correction_factor[data->bucket],
>> +					 RESOLUTION * DECAY);
>> +
>> +	get_typical_interval(data);
>> +
>> +	/*
>> +	 * Performance multiplier defines a minimum predicted idle
>> +	 * duration / latency ratio. Adjust the latency limit if
>> +	 * necessary.
>> +	 */
>> +	interactivity_req = data->predicted_us / performance_multiplier();
>> +	if (latency_req > interactivity_req)
>> +		latency_req = interactivity_req;
>> +
>> +	/*
>> +	 * We want to default to C1 (hlt), not to busy polling
>> +	 * unless the timer is happening really really soon.
>> +	 */
>> +	if (data->next_timer_us > 5 &&
>> +	    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
>> +		dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
>> +		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
>> +
>> +	/*
>> +	 * Find the idle state with the lowest power while satisfying
>> +	 * our constraints.
>> +	 */
>> +	for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
>> +		struct cpuidle_state *s = &drv->states[i];
>> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
>> +
>> +		if (s->disabled || su->disable)
>> +			continue;
>> +		if (s->target_residency > data->predicted_us)
>> +			continue;
>> +		if (s->exit_latency > latency_req)
>> +			continue;
>> +
>> +		data->last_state_idx = i;
>> +	}
>> +
>> +	return data->last_state_idx;
>> +}
>> +
>> +/**
>> + * cpuidle_sched_reflect - records that data structures need update
>> + * @dev: the CPU
>> + * @index: the index of actual entered state
>> + *
>> + * NOTE: it's important to be fast here because this operation will add to
>> + *       the overall exit latency.
>> + */
>> +void cpuidle_sched_reflect(struct cpuidle_device *dev, int index)
>> +{
>> +	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
>> +	data->last_state_idx = index;
>> +	if (index >= 0)
>> +		data->needs_update = 1;
>> +}
>> +
>> +/**
>> + * cpuidle_sched_update - attempts to guess what happened after entry
>> + * @drv: cpuidle driver containing state data
>> + * @dev: the CPU
>> + */
>> +static void cpuidle_sched_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>> +{
>> +	struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
>> +	int last_idx = data->last_state_idx;
>> +	struct cpuidle_state *target = &drv->states[last_idx];
>> +	unsigned int measured_us;
>> +	unsigned int new_factor;
>> +
>> +	/*
>> +	 * Try to figure out how much time passed between entry to low
>> +	 * power state and occurrence of the wakeup event.
>> +	 *
>> +	 * If the entered idle state didn't support residency measurements,
>> +	 * we are basically lost in the dark how much time passed.
>> +	 * As a compromise, assume we slept for the whole expected time.
>> +	 *
>> +	 * Any measured amount of time will include the exit latency.
>> +	 * Since we are interested in when the wakeup begun, not when it
>> +	 * was completed, we must subtract the exit latency. However, if
>> +	 * the measured amount of time is less than the exit latency,
>> +	 * assume the state was never reached and the exit latency is 0.
>> +	 */
>> +	if (unlikely(!(target->flags & CPUIDLE_FLAG_TIME_VALID))) {
>> +		/* Use timer value as is */
>> +		measured_us = data->next_timer_us;
>> +
>> +	} else {
>> +		/* Use measured value */
>> +		measured_us = cpuidle_get_last_residency(dev);
>> +
>> +		/* Deduct exit latency */
>> +		if (measured_us > target->exit_latency)
>> +			measured_us -= target->exit_latency;
>> +
>> +		/* Make sure our coefficients do not exceed unity */
>> +		if (measured_us > data->next_timer_us)
>> +			measured_us = data->next_timer_us;
>> +	}
>> +
>> +	/* Update our correction ratio */
>> +	new_factor = data->correction_factor[data->bucket];
>> +	new_factor -= new_factor / DECAY;
>> +
>> +	if (data->next_timer_us > 0 && measured_us < MAX_INTERESTING)
>> +		new_factor += RESOLUTION * measured_us / data->next_timer_us;
>> +	else
>> +		/*
>> +		 * we were idle so long that we count it as a perfect
>> +		 * prediction
>> +		 */
>> +		new_factor += RESOLUTION;
>> +
>> +	/*
>> +	 * We don't want 0 as factor; we always want at least
>> +	 * a tiny bit of estimated time. Fortunately, due to rounding,
>> +	 * new_factor will stay nonzero regardless of measured_us values
>> +	 * and the compiler can eliminate this test as long as DECAY > 1.
>> +	 */
>> +	if (DECAY == 1 && unlikely(new_factor == 0))
>> +		new_factor = 1;
>> +
>> +	data->correction_factor[data->bucket] = new_factor;
>> +
>> +	/* update the repeating-pattern data */
>> +	data->intervals[data->interval_ptr++] = measured_us;
>> +	if (data->interval_ptr >= INTERVALS)
>> +		data->interval_ptr = 0;
>> +}
>> +
>> +/**
>> + * cpuidle_sched_enable_device - scans a CPU's states and does setup
>> + * @drv: cpuidle driver
>> + * @dev: the CPU
>> + */
>> +int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
>> +				struct cpuidle_device *dev)
>> +{
>> +	struct sched_cpuidle_info *data = &per_cpu(cpuidle_info, dev->cpu);
>> +	int i;
>> +
>> +	memset(data, 0, sizeof(struct sched_cpuidle_info));
>> +
>> +	/*
>> +	 * if the correction factor is 0 (eg first time init or cpu hotplug
>> +	 * etc), we actually want to start out with a unity factor.
>> +	 */
>> +	for(i = 0; i < BUCKETS; i++)
>> +		data->correction_factor[i] = RESOLUTION * DECAY;
>> +
>> +	return 0;
>> +}
>> +
>>
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/