linux-kernel - Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200225061857.GH12846@in.ibm.com>
Date:   Tue, 25 Feb 2020 11:48:57 +0530
From:   Gautham R Shenoy <ego@...ux.vnet.ibm.com>
To:     Pratik Rajesh Sampat <psampat@...ux.ibm.com>
Cc:     linux-kernel@...r.kernel.org, rafael.j.wysocki@...el.com,
        peterz@...radead.org, dsmythies@...us.net,
        daniel.lezcano@...aro.org, ego@...ux.vnet.ibm.com,
        svaidy@...ux.ibm.com, pratik.sampat@...ibm.com,
        pratik.r.sampat@...il.com
Subject: Re: [RFC 1/1] Weighted approach to gather and use history in TEO
 governor

On Sat, Feb 22, 2020 at 12:30:02PM +0530, Pratik Rajesh Sampat wrote:
> Complementing the current self correcting window algorithm, an alternate
> approach is devised to leverage lifetime history with constant overhead
> 
> Each CPU maintains a matrix wherein each idle state maintains a
> probability distribution.
> 
> The probability distribution is nothing but a n*n matrix, where
> n = drv->state_count.
> Each entry in the array signifies a weight for that row.
> The weights can vary from the range [0-10000].
> 
> For example:
> state_mat[1][2] = 3000 means that previously when state 1 was selected,
> the probability that state 2 will occur is 30%.
> The trailing zeros correspond to having more resolution while increasing
> or reducing the weights for correction.
> 
> Initially the weights are distributed in a way such that the index of
> that state in question has a higher probability of choosing itself, as
> we have no reason to believe otherwise yet. Initial bias to itself is
> 70% and the rest 30% is equally distributed to the rest of the states.
> 
> Selection of an idle state:
> When the TEO governor chooses an idle state, the probability
> distribution for that state is looked at. A weighted random number
> generator is used using the weights as bias to choose the next idle
> state. The algorithm leans to choose that or a shallower state than that
> for its next prediction
> 
> Correction of the probability distribution:
> On wakeup, the weights are updated. The state which it should have woken
> up with (could be the hit / miss / early hit state) is increased in
> weight by the "LEARNING_RATE" % and the rest of the states for that
> index are reduced by the same factor.
> The LEARNING RATE is experimentally chosen to be 5 %


I know this is an RFC patch, not meant for inclusion, but it is good
practice to have your Signed-off-by.


> ---
>  drivers/cpuidle/governors/teo.c | 95 +++++++++++++++++++++++++++++++--
>  1 file changed, 90 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index de7e706efd46..8060c287f5e4 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -50,6 +50,7 @@
>  #include <linux/kernel.h>
>  #include <linux/sched/clock.h>
>  #include <linux/tick.h>
> +#include <linux/random.h>
> 
>  /*
>   * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
> @@ -64,6 +65,12 @@
>   */
>  #define INTERVALS	8
> 
> +/*
> + * Percentage of the amount of weight to be shifted in the idle state weight
> + * distribution for correction
> + */
> +#define LEARNING_RATE	5
> +
>  /**
>   * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
>   * @early_hits: "Early" CPU wakeups "matching" this state.
> @@ -98,6 +105,8 @@ struct teo_idle_state {
>   * @states: Idle states data corresponding to this CPU.
>   * @interval_idx: Index of the most recent saved idle interval.
>   * @intervals: Saved idle duration values.
> + * @state_mat: Each idle state maintains a weights corresponding to that
> + * state, storing the probablity distribution of occurance for that state
>   */
>  struct teo_cpu {
>  	u64 time_span_ns;
> @@ -105,6 +114,7 @@ struct teo_cpu {
>  	struct teo_idle_state states[CPUIDLE_STATE_MAX];
>  	int interval_idx;
>  	u64 intervals[INTERVALS];
> +	int state_mat[CPUIDLE_STATE_MAX][CPUIDLE_STATE_MAX];
>  };
> 
>  static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
> @@ -117,7 +127,7 @@ static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
>  static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  {
>  	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> -	int i, idx_hit = -1, idx_timer = -1;
> +	int i, idx_hit = -1, idx_timer = -1, idx = -1, last_idx = dev->last_state_idx;
>  	u64 measured_ns;
> 
>  	if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
> @@ -183,16 +193,50 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> 
>  		if (idx_timer > idx_hit) {
>  			misses += PULSE;
> -			if (idx_hit >= 0)
> +			idx = idx_timer;
> +			if (idx_hit >= 0) {
>  				cpu_data->states[idx_hit].early_hits += PULSE;
> +				idx = idx_hit;
> +			}
>  		} else {
>  			hits += PULSE;
> +			idx = last_idx;
>  		}
> 
>  		cpu_data->states[idx_timer].misses = misses;
>  		cpu_data->states[idx_timer].hits = hits;
>  	}
> 
> +	/*
> +	 * Rearrange the weight distribution of the state, increase the weight
> +	 * by the LEARNING RATE % for the idle state that was supposed to be
> +	 * chosen and reduce by the same amount for rest of the states
> +	 *
> +	 * If the weights are greater than (100 - LEARNING_RATE) % or lesser
> +	 * than LEARNING_RATE %, do not increase or decrease the confidence
> +	 * respectively
> +	 */
> +	for (i = 0; i < drv->state_count; i++) {
> +		unsigned int delta;
> +
> +		if (idx == -1)
> +			break;
> +		if (i ==  idx) {
> +			delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100;
> +			if (cpu_data->state_mat[last_idx][i] + delta >=
> +			    (100 - LEARNING_RATE) * 100)
> +				continue;
> +			cpu_data->state_mat[last_idx][i] += delta;
> +			continue;
> +		}
> +		delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) /
> +			((drv->state_count - 1) * 100);


What happens when drv->state_count == 1?

> +		if (cpu_data->state_mat[last_idx][i] - delta <=
> +		    LEARNING_RATE * 100)
> +			continue;
> +		cpu_data->state_mat[last_idx][i] -= delta;

So, even update takes O(n) time, since we have to increase the weight
for one state, and correspondingly decrease the state for all the
other states.


> +	}
> +
>  	/*
>  	 * Save idle duration values corresponding to non-timer wakeups for
>  	 * pattern detection.
> @@ -244,7 +288,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
>  	u64 duration_ns;
>  	unsigned int hits, misses, early_hits;
> -	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i;
> +	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i, og_idx;
>  	ktime_t delta_tick;
> 
>  	if (dev->last_state_idx >= 0) {
> @@ -374,10 +418,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	if (constraint_idx < idx)
>  		idx = constraint_idx;
> 
> +	og_idx = idx;
> +
>  	if (idx < 0) {
>  		idx = 0; /* No states enabled. Must use 0. */
>  	} else if (idx > 0) {
> -		unsigned int count = 0;
> +		unsigned int count = 0, sum_weights = 0, weights_list[CPUIDLE_STATE_MAX];
> +		int i, j = 0, rnd_wt, rnd_num = 0;
>  		u64 sum = 0;
> 
>  		/*
> @@ -412,6 +459,29 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  								       idx, avg_ns);
>  			}
>  		}
> +		/*
> +		 * In case, the recent history yields a shallower state, then
> +		 * the probability distribution is looked at.
> +		 * The weighted random number generator uses the weights as a
> +		 * bias to choose the next idle state
> +		 */
> +		if (og_idx != idx) {
> +			for (i = 0; i <= idx; i++) {


So it seems like we are restricting our choice to states no deeper
than the selected state.

Is it not possible that cpu_data->state_mat[idx][j] has the
maximum weight when j > idx ? If yes, why are we leaving those states
out ?

> +				if (dev->states_usage[i].disable)
> +					continue;
> +				sum_weights += cpu_data->state_mat[idx][i];
> +				weights_list[j++] = sum_weights;
> +			}

Assume that cpu_data->stat_mat[idx] = {4, 5, 6, 9}
weight_list[] = {4, 9, 15, 24}

> +			get_random_bytes(&rnd_num, sizeof(rnd_num));
> +			rnd_num = rnd_num % 100;

Assume rnd_num = 50.
> +			rnd_wt = (rnd_num * sum_weights) / 100;


Then, rnd_wt = 12.

>From the logic, below, it appears that you want to pick the shallowest
state i at which rnd_wt < weights_list[i]. In which case it would be
the state corresponding to the weight 6 (as the cumulative weight at
that point is 15).


> +			for (i = 0; i < j; i++) {
> +				if (rnd_wt < weights_list[i])
> +					break;
> +				rnd_wt -= weights_list[i];


And yet, because of this additional subtraction, after the first
iteration of this loop, rnd_wt = 12 - 4 = 8, which means that you will
end up picking the state corresponding to weight 5 (cumulative weight
being 9 at this point).

This doesn't seem right.

> +			}
> +			idx = i;
> +		}
>  	}
> 
>  	/*
> @@ -468,13 +538,28 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>  			     struct cpuidle_device *dev)
>  {
>  	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> -	int i;
> +	int i, j;
> 
>  	memset(cpu_data, 0, sizeof(*cpu_data));
> 
>  	for (i = 0; i < INTERVALS; i++)
>  		cpu_data->intervals[i] = U64_MAX;
> 
> +	/*
> +	 * Populate initial weights for each state
> +	 * The stop state is initially more biased for itself.
> +	 *
> +	 * Currently the initial distribution of probabilities are 70%-30%.
> +	 * The trailing 0s are for increased resolution.
> +	 */
> +	for (i = 0; i < drv->state_count; i++) {
> +		for (j = 0; j < drv->state_count; j++) {
> +			if (i == j)
> +				cpu_data->state_mat[i][j] = 7000;
> +			else
> +				cpu_data->state_mat[i][j] = 3000 / (drv->state_count - 1);


> +		}
> +	}
>  	return 0;
>  }
> 
> -- 
> 2.17.1
>