linux-kernel - Re: [PATCH] cpuidle/drivers/mobile: Add new governor for mobile/embedded systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6dd366d4-5625-254b-1933-17520d68aa79@linaro.org>
Date:   Mon, 8 Jul 2019 11:57:32 +0200
From:   Daniel Lezcano <daniel.lezcano@...aro.org>
To:     "Rafael J. Wysocki" <rjw@...ysocki.net>
Cc:     rafael@...nel.org, linux-kernel@...r.kernel.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "open list:CPU IDLE TIME MANAGEMENT FRAMEWORK" 
        <linux-pm@...r.kernel.org>
Subject: Re: [PATCH] cpuidle/drivers/mobile: Add new governor for
 mobile/embedded systems


Hi Rafael,

On 04/07/2019 12:14, Rafael J. Wysocki wrote:
> On Thursday, June 20, 2019 1:58:08 PM CEST Daniel Lezcano wrote:
>> The objective is the same for all the governors: save energy, but at
>> the end the governors menu, ladder and teo aim to improve the
>> performances with an acceptable energy drop for some workloads which
>> are identified for servers and desktops (with the help of a firmware).
>>
>> The ladder governor is designed for server with a periodic tick
>> configuration.
>>
>> The menu governor does not behave nicely with the mobile platform and
>> the energy saving for the multimedia workloads is worst than picking
>> up randomly an idle state.
>>
>> The teo governor acts efficiently, it promotes shallower state for
>> performances which is perfect for the servers / desktop but inadequate
>> for mobile because the energy consumed is too high.
>>
>> It is very difficult to do changes in these governors for embedded
>> systems without impacting performances on servers/desktops or ruin the
>> optimizations for the workloads on these platforms.
>>
>> The mobile governor is a new governor targeting embedded systems
>> running on battery where the energy saving has a higher priority than
>> servers or desktops. This governor aims to save energy as much as
>> possible but with a performance degradation tolerance.
>>
>> In this way, we can optimize the governor for specific mobile workload
>> and more generally embedded systems without impacting other platforms.
>>
>> The mobile governor is built on top of the paradigm 'separate the wake
>> up sources signals and analyze them'. Three categories of wake up
>> signals are identified:
>>  - deterministic : timers
>>  - predictable : most of the devices interrupt
>>  - unpredictable : IPI rescheduling, random signals
>>
>> The latter needs an iterative approach and the help of the scheduler
>> to give more input to the governor.
>>
>> The governor uses the irq timings where we predict the next interrupt
>> occurrences on the current CPU and the next timer. It is well suited
>> for mobile and more generally embedded systems where the interrupts
>> are usually pinned on one CPU and where the power is more important
>> than the performances.
>>
>> The multimedia applications on the embedded system spawn multiple
>> threads which are migrated across the different CPUs and waking
>> between them up. In order to catch this situation we have also to
>> track the idle task rescheduling duration with a relative degree of
>> confidence as the scheduler is involved in the task migrations. The
>> resched information is in the scope of the governor via the reflect
>> callback.
>>
>> The governor begins with a clean foundation basing the prediction on
>> the irq behavior returned by the irq timings, the timers and the idle
>> task rescheduling. The advantage of the approach is we have a full
>> view of the wakeup sources as we identify them separately and then we
>> can control the situation without relying on biased heuristics.
>>
>> This first iteration provides a basic prediction but improves on some
>> mobile platforms better energy for better performance for multimedia
>> workloads.
>>
>> The scheduling aspect will be optimized iteratively with non
>> regression testing for previous identified workloads on an Android
>> reference platform.
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@...aro.org>
> 
> Note that there are build issues reported by 0-day that need to be fixed.
> Also, IMO this really should be documented better in the tree, not just in the changelog.
> At least the use case to be covered by this governor should be clearly documented and
> it would be good to describe the algorithm.

Ok, I will add some documentation.

>> ---
>>  drivers/cpuidle/Kconfig            |  11 ++-
>>  drivers/cpuidle/governors/Makefile |   1 +
>>  drivers/cpuidle/governors/mobile.c | 151 +++++++++++++++++++++++++++++
>>  3 files changed, 162 insertions(+), 1 deletion(-)
>>  create mode 100644 drivers/cpuidle/governors/mobile.c
>>
>> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
>> index a4ac31e4a58c..e2376d85e288 100644
>> --- a/drivers/cpuidle/Kconfig
>> +++ b/drivers/cpuidle/Kconfig
>> @@ -5,7 +5,7 @@ config CPU_IDLE
>>  	bool "CPU idle PM support"
>>  	default y if ACPI || PPC_PSERIES
>>  	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
>> -	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_TEO
>> +	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_TEO && !CPU_IDLE_GOV_MOBILE
>>  	help
>>  	  CPU idle is a generic framework for supporting software-controlled
>>  	  idle processor power management.  It includes modular cross-platform
>> @@ -33,6 +33,15 @@ config CPU_IDLE_GOV_TEO
>>  	  Some workloads benefit from using it and it generally should be safe
>>  	  to use.  Say Y here if you are not happy with the alternatives.
>>  
>> +config CPU_IDLE_GOV_MOBILE
>> +	bool "Mobile governor"
>> +	select IRQ_TIMINGS
>> +	help
>> +	  The mobile governor is based on irq timings measurements and
>> +	  pattern research combined with the next timer. This governor
>> +	  suits very well on embedded systems where the interrupts are
>> +	  grouped on a single core and the power is the priority.
>> +
>>  config DT_IDLE_STATES
>>  	bool
>>  
>> diff --git a/drivers/cpuidle/governors/Makefile b/drivers/cpuidle/governors/Makefile
>> index 42f44cc610dd..f09da7178670 100644
>> --- a/drivers/cpuidle/governors/Makefile
>> +++ b/drivers/cpuidle/governors/Makefile
>> @@ -6,3 +6,4 @@
>>  obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o
>>  obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o
>>  obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o
>> +obj-$(CONFIG_CPU_IDLE_GOV_MOBILE) += mobile.o
>> diff --git a/drivers/cpuidle/governors/mobile.c b/drivers/cpuidle/governors/mobile.c
>> new file mode 100644
>> index 000000000000..8fda0f9b960b
>> --- /dev/null
>> +++ b/drivers/cpuidle/governors/mobile.c
>> @@ -0,0 +1,151 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2019, Linaro Ltd
>> + * Author: Daniel Lezcano <daniel.lezcano@...aro.org>
>> + */
>> +#include <linux/cpuidle.h>
>> +#include <linux/kernel.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/tick.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/sched/clock.h>
>> +
>> +struct mobile_device {
>> +	u64 idle_ema_avg;
>> +	u64 idle_total;
>> +	unsigned long last_jiffies;
>> +};
>> +
>> +#define EMA_ALPHA_VAL		64
>> +#define EMA_ALPHA_SHIFT		7
>> +#define MAX_RESCHED_INTERVAL_MS	100
>> +
>> +static DEFINE_PER_CPU(struct mobile_device, mobile_devices);
>> +
>> +static int mobile_ema_new(s64 value, s64 ema_old)
>> +{
>> +	if (likely(ema_old))
>> +		return ema_old + (((value - ema_old) * EMA_ALPHA_VAL) >>
>> +				  EMA_ALPHA_SHIFT);
>> +	return value;
>> +}
>> +
>> +static void mobile_reflect(struct cpuidle_device *dev, int index)
>> +{
>> +        struct mobile_device *mobile_dev = this_cpu_ptr(&mobile_devices);
>> +	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
>> +	struct cpuidle_state *s = &drv->states[index];
>> +	int residency;
>> +
>> +	/*
>> +	 * The idle task was not rescheduled since
>> +	 * MAX_RESCHED_INTERVAL_MS, let's consider the duration is
>> +	 * long enough to clear our stats.
>> +	 */
>> +	if (time_after(jiffies, mobile_dev->last_jiffies +
>> +		       msecs_to_jiffies(MAX_RESCHED_INTERVAL_MS)))
>> +		mobile_dev->idle_ema_avg = 0;
> 
> Why jiffies?  Any particular reason?

I used jiffies to not use the local_clock() call for the overhead. I
agree the resolution could be too low. Perhaps it makes more sense to
move idle start and idle end variable from the cpuidle_enter function to
the cpuidle device structure, so the information can be reused by the
subsequent users.

>> +
>> +	/*
>> +	 * Sum all the residencies in order to compute the total
>> +	 * duration of the idle task.
>> +	 */
>> +	residency = dev->last_residency - s->exit_latency;
>> +	if (residency > 0)
>> +		mobile_dev->idle_total += residency;
>> +
>> +	/*
>> +	 * We exited the idle state with the need_resched() flag, the
>> +	 * idle task will be rescheduled, so store the duration the
>> +	 * idle task was scheduled in an exponential moving average and
>> +	 * reset the total of the idle duration.
>> +	 */
>> +	if (need_resched()) {
>> +		mobile_dev->idle_ema_avg = mobile_ema_new(mobile_dev->idle_total,
>> +						      mobile_dev->idle_ema_avg);
>> +		mobile_dev->idle_total = 0;
>> +		mobile_dev->last_jiffies = jiffies;
>> +	}
>> +}
>> +
>> +static int mobile_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>> +		       bool *stop_tick)
>> +{
>> +	struct mobile_device *mobile_dev = this_cpu_ptr(&mobile_devices);
>> +	int latency_req = cpuidle_governor_latency_req(dev->cpu);
>> +	int i, index = 0;
>> +	ktime_t delta_next;
>> +	u64 now, irq_length, timer_length;
>> +	u64 idle_duration_us;
>> +
>> +	/*
>> +	 * Get the present time as reference for the next steps
>> +	 */
>> +	now = local_clock();
>> +
>> +	/*
>> +	 * Get the next interrupt event giving the 'now' as a
>> +	 * reference, if the next event appears to have already
>> +	 * expired then we get the 'now' returned which ends up with a
>> +	 * zero duration.
>> +	 */
>> +	irq_length = irq_timings_next_event(now) - now;
>> +
>> +	/*
>> +	 * Get the timer duration before expiration.
>> +	 */
> 
> This comment is rather redundant and the one below too. :-)

Right.

>> +	timer_length = ktime_to_ns(tick_nohz_get_sleep_length(&delta_next));
>> +
>> +	/*
>> +	 * Get the smallest duration between the timer and the irq next event.
>> +	 */
>> +	idle_duration_us = min_t(u64, irq_length, timer_length) / NSEC_PER_USEC;
>> +
>> +	/*
>> +	 * Get the idle task duration average if the information is
>> +	 * available.
> 
> IMO it would be good to explain this step in more detail, especially the purpose of it.

Ok.

>> +	 */
>> +	if (mobile_dev->idle_ema_avg)
>> +		idle_duration_us = min_t(u64, idle_duration_us,
>> +					 mobile_dev->idle_ema_avg);
>> +
>> +	for (i = 0; i < drv->state_count; i++) {
>> +		struct cpuidle_state *s = &drv->states[i];
>> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
>> +
>> +		if (s->disabled || su->disable)
>> +			continue;
>> +
>> +		if (s->exit_latency > latency_req)
>> +			break;
>> +
>> +		if (idle_duration_us > s->exit_latency)
>> +			idle_duration_us = idle_duration_us - s->exit_latency;
> 
> Why do you want this?
> 
> It only causes you to miss an opportunity to select a deeper state sometimes,
> so what's the reason?

On the mobile platform the exit latencies are very high (with an order
of magnitude of several milliseconds) for a very limited number of idle
states. The real idle duration must be determined to compare to the
target residency. Without this test, the governor is constantly choosing
wrongly a deep idle state.

> Moreover, I don't think you should update idle_duration_us here, as the updated
> value will go to the next step if the check below doesn't trigger.

Right, I spotted it also and fixed with:

+               if (s->exit_latency >= idle_duration_us)
+                       break;

+               if (s->target_residency > (idle_duration_us -
s->exit_latency))
                        break;

>> +
>> +		if (s->target_residency > idle_duration_us)
>> +			break;
>> +
>> +		index = i;
>> +	}
>> +
>> +	if (!index)
>> +		*stop_tick = false;
> 
> Well, this means that the tick is stopped for all idle states deeper than state 0.
> 
> If there are any states between state 0 and the deepest one and they are below
> the tick boundary, you may very well suffer the "powernightmares" problem
> because of this.

What would you suggest?

if (!index || ((idle_duration_us < TICK_USEC) &&
		!tick_nohz_tick_stopped()))
	*stop_tick = false;

?

There are too few idle states to restart a selection at this point, so
preventing stopping the tick is enough at this point IMO.


>> +	return index;
>> +}
>> +
>> +static struct cpuidle_governor mobile_governor = {
>> +	.name =		"mobile",
>> +	.rating =	20,
>> +	.select =	mobile_select,
>> +	.reflect =	mobile_reflect,
>> +};
>> +
>> +static int __init init_governor(void)
>> +{
>> +	irq_timings_enable();
>> +	return cpuidle_register_governor(&mobile_governor);
>> +}
>> +
>> +postcore_initcall(init_governor);
>>
> 
> 
> 
> 


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog