linux-kernel - Re: [PATCH][v3] cpufreq: governor: Fix overflow when calculating idle time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJZ5v0g1Ymh72-m2DcNN0KVAh=d-u2tt3d7ikb-X+ci=KcMVvQ@mail.gmail.com>
Date:	Wed, 20 Apr 2016 15:52:52 +0200
From:	"Rafael J. Wysocki" <rafael@...nel.org>
To:	"Rafael J. Wysocki" <rjw@...ysocki.net>
Cc:	Chen Yu <yu.c.chen@...el.com>,
	"linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	"Rafael J. Wysocki" <rafael@...nel.org>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Len Brown <lenb@...nel.org>
Subject: Re: [PATCH][v3] cpufreq: governor: Fix overflow when calculating idle time

On Wed, Apr 20, 2016 at 3:26 AM, Rafael J. Wysocki <rjw@...ysocki.net> wrote:
> On Tuesday, April 19, 2016 11:57:32 AM Chen Yu wrote:
>> It was reported that after Commit 0df35026c6a5 ("cpufreq: governor:
>> Fix negative idle_time when configured with CONFIG_HZ_PERIODIC"),
>> cpufreq ondemand governor started to act oddly. Without any load,
>> with freshly booted system, it pumped cpu frequency up to maximum
>> at some point of time and stayed there. The problem is caused by
>> jiffies overflow in get_cpu_idle_time:
>>
>> After booting up 5 minutes, the jiffies will round up to zero.
>> As a result, the following condition in cpu governor will always be
>> true:
>>       if (cur_idle_time <= j_cdbs->prev_cpu_idle)
>>               idle_time = 0;
>>
>> which caused problems.
>>
>> For example, once cur_idle_time has rounded up to zero, meanwhile
>> prev_cpu_idle still remains negative(because of jiffies initial value
>> of -300HZ, which is very big after converted to unsigned), thus above
>> condition is met, thus we get a zero of idle running time during
>> this sample, which causes a high busy time, thus governor always
>> requests for the highest freq.
>>
>> This patch fixes this problem by updating prev_cpu_idle for
>> each sample period, even if prev_cpu_idle is bigger than
>> cur_idle_time, thus to prevent the scenario of 'prev_cpu_idle always
>> bigger than cur_idle_time' from happening.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=115261
>> Reported-by: Timo Valtoaho <timo.valtoaho@...il.com>
>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>
> This looks better than the previous versions to me, but ->
>
>> ---
>> v3:
>>  - Do not use INITIAL_JIFFIES because it should be transparent
>>    to user, meanwhile keep original semanteme to use delta
>>    of time slice.
>> ---
>> v2:
>>  - Send this patch to a wider scope, including timing-system maintainers,
>>    as well as some modifications in the commit message to make it more clear.
>> ---
>>  drivers/cpufreq/cpufreq.c          | 4 ++++
>>  drivers/cpufreq/cpufreq_governor.c | 8 +++++++-
>>  2 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>> index b87596b..b0479b3 100644
>> --- a/drivers/cpufreq/cpufreq.c
>> +++ b/drivers/cpufreq/cpufreq.c
>> @@ -132,6 +132,10 @@ struct cpufreq_frequency_table *cpufreq_frequency_get_table(unsigned int cpu)
>>  }
>>  EXPORT_SYMBOL_GPL(cpufreq_frequency_get_table);
>>
>> +/**
>> + * The wall time and idle time are both possible to round up,
>
> That's difficult to parse.  I guess you wanted to say that they may overflow?
>
>> + * people should use delta rather than the value itself.
>> + */
>
> -> this new comment doesn't really belong to the fix.  You can send a separate
> patch adding it.
>
> Moreover -->
>
>>  static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
>>  {
>>       u64 idle_time;
>> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
>> index 10a5cfe..8de3fba 100644
>> --- a/drivers/cpufreq/cpufreq_governor.c
>> +++ b/drivers/cpufreq/cpufreq_governor.c
>> @@ -197,8 +197,14 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>>                       idle_time = 0;
>>               } else {
>>                       idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
>> -                     j_cdbs->prev_cpu_idle = cur_idle_time;
>>               }
>> +             /*
>> +              * It is possible prev_cpu_idle being bigger than cur_idle_time,
>> +              * when 32bit rounds up if !CONFIG_VIRT_CPU_ACCOUNTING,
>> +              * thus get a 0% idle estimation. So update prev_cpu_idle during
>> +              * each sample period to avoid this situation lasting too long.
>> +              */
>> +             j_cdbs->prev_cpu_idle = cur_idle_time;
>
> --> it looks like the bug is that we are comparing signed values as unsigned.
>
>>
>>               if (ignore_nice) {
>>                       u64 cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
>>
>
> So what about the simple change below?

Well, it doesn't make sense, sorry about the confusion.

> ---
>  drivers/cpufreq/cpufreq_governor.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> @@ -146,7 +146,7 @@ unsigned int dbs_update(struct cpufreq_p
>                 wall_time = cur_wall_time - j_cdbs->prev_cpu_wall;
>                 j_cdbs->prev_cpu_wall = cur_wall_time;
>
> -               if (cur_idle_time <= j_cdbs->prev_cpu_idle) {
> +               if ((s64)cur_idle_time <= (s64)j_cdbs->prev_cpu_idle) {
>                         idle_time = 0;
>                 } else {
>                         idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
>