[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1872489.HpEenvS50o@vostro.rjw.lan>
Date: Sun, 03 Jan 2016 01:56:31 +0100
From: "Rafael J. Wysocki" <rjw@...ysocki.net>
To: Chen Yu <yu.c.chen@...el.com>
Cc: viresh.kumar@...aro.org, lenb@...nel.org,
srinivas.pandruvada@...ux.intel.com, rui.zhang@...el.com,
linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC
On Wednesday, December 16, 2015 12:20:29 PM Chen Yu wrote:
> It is reported that, with CONFIG_HZ_PERIODIC=y cpu stays at the
> lowest frequency even if the usage goes to 100%, neither ondemand
> nor conservative governor works, however performance and
> userspace work as expected. If set with CONFIG_NO_HZ_FULL=y,
> everything goes well.
>
> This problem is caused by improper calculation of the idle_time
> when the load is extremely high(near 100%). Firstly, cpufreq_governor
> uses get_cpu_idle_time to get the total idle time for specific cpu, then:
>
> 1.If the system is configured with CONFIG_NO_HZ_FULL, the idle time is
> returned by ktime_get, which is always increasing, it's OK.
> 2.However, if the system is configured with CONFIG_HZ_PERIODIC,
> get_cpu_idle_time might not guarantee to be always increasing,
> because it will leverage get_cpu_idle_time_jiffy to calculate the
> idle_time, consider the following scenario:
>
> At T1:
> idle_tick_1 = total_tick_1 - user_tick_1
>
> sample period(80ms)...
>
> At T2: ( T2 = T1 + 80ms):
> idle_tick_2 = total_tick_2 - user_tick_2
>
> Currently the algorithm is using (idle_tick_2 - idle_tick_1) to
> get the delta idle_time during the past sample period, however
> it CAN NOT guarantee that idle_tick_2 >= idle_tick_1, especially
> when cpu load is high.
> (Yes, total_tick_2 >= total_tick_1, and user_tick_2 >= user_tick_1,
> but how about idle_tick_2 and idle_tick_1? No guarantee.)
> So governor might get a negative value of idle_time during the past
> sample period, which might mislead the system that the idle time is
> very big(converted to unsigned int), and the busy time is nearly zero,
> which causes the governor to always choose the lowest cpufreq,
> then cause this problem.
>
> In theory there are two solutions:
>
> 1.The logic should not rely on the idle tick during every sample period,
> but be based on the busy tick directly, as this is how 'top' is
> implemented.
>
> 2.Or the logic must make sure that the idle_time is strictly increasing
> during each sample period, then there would be no negative idle_time
> anymore. This solution requires minimum modification to current code
> and this patch uses method 2.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=69821
> Reported-by: Jan Fikar <j.fikar@...il.com>
> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
Applied, thanks!
> ---
> drivers/cpufreq/cpufreq_governor.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
> index b260576..363445f 100644
> --- a/drivers/cpufreq/cpufreq_governor.c
> +++ b/drivers/cpufreq/cpufreq_governor.c
> @@ -84,6 +84,9 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
> (cur_wall_time - j_cdbs->prev_cpu_wall);
> j_cdbs->prev_cpu_wall = cur_wall_time;
>
> + if (cur_idle_time < j_cdbs->prev_cpu_idle)
> + cur_idle_time = j_cdbs->prev_cpu_idle;
> +
> idle_time = (unsigned int)
> (cur_idle_time - j_cdbs->prev_cpu_idle);
> j_cdbs->prev_cpu_idle = cur_idle_time;
>
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists