linux-kernel - Re: [PATCH v3 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <02ec23c3-37ee-4e9f-56a4-453a30a29747@puri.sm>
Date:   Mon, 5 Aug 2019 08:53:39 +0200
From:   Martin Kepplinger <martin.kepplinger@...i.sm>
To:     daniel.lezcano@...aro.org, viresh.kumar@...aro.org,
        kevin.wangtao@...aro.org, leo.yan@...aro.org, edubezval@...il.com,
        vincent.guittot@...aro.org, javi.merino@...nel.org,
        rui.zhang@...el.com, daniel.thompson@...aro.org
Cc:     linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 6/7] thermal/drivers/cpu_cooling: Introduce the cpu
 idle cooling driver

On 05.08.19 07:11, Martin Kepplinger wrote:
> ---
> 
> On 05-04-18, 18:16, Daniel Lezcano wrote:
>> The cpu idle cooling driver performs synchronized idle injection across all
>> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>>
>> Each cluster has its own idle cooling device, each core has its own idle
>> injection thread, each idle injection thread uses play_idle to enter idle.  In
>> order to reach the deepest idle state, each cooling device has the idle
>> injection threads synchronized together.
>>
>> It has some similarity with the intel power clamp driver but it is actually
>> designed to work on the ARM architecture via the DT with a mathematical proof
>> with the power model which comes with the Documentation.
>>
>> The idle injection cycle is fixed while the running cycle is variable. That
>> allows to have control on the device reactivity for the user experience. At
>> the mitigation point the idle threads are unparked, they play idle the
>> specified amount of time and they schedule themselves. The last thread sets
>> the next idle injection deadline and when the timer expires it wakes up all
>> the threads which in turn play idle again. Meanwhile the running cycle is
>> changed by set_cur_state.  When the mitigation ends, the threads are parked.
>> The algorithm is self adaptive, so there is no need to handle hotplugging.
>>
>> If we take an example of the balanced point, we can use the DT for the hi6220.
>>
>> The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores
>> running at full blast at the maximum OPP consumes 5280mW. The first value is
>> given in the DT, the second is calculated from the OPP with the formula:
>>
>>    Pdyn = Cdyn x Voltage^2 x Frequency
>>
>> As the SoC vendors don't want to share the static leakage values, we assume
>> it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.
>>
>> In order to reduce the power to 3326mW, we have to apply a ratio to the
>> running time.
>>
>> ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874
>>
>> We know the idle cycle which is fixed, let's assume 10ms. However from this
>> duration we have to substract the wake up latency for the cluster idle state.
>> In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
>> 8.5ms.
>>
>> As we know the idle duration and the ratio, we can compute the running cycle.
>>
>>    running_cycle = 8.5 / 0.5874 = 14.47ms
>>
>> So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
>> SoC to the balanced trip point of 75°C.
>>
>> The driver has been tested on the hi6220 and it appears the temperature
>> stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and
>> running cycle of 14ms as expected by the theory above.
>>
>> Signed-off-by: Kevin Wangtao <kevin.wangtao@...aro.org>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@...aro.org>
>> ---
>>  drivers/thermal/Kconfig       |  10 +
>>  drivers/thermal/cpu_cooling.c | 479 ++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/cpu_cooling.h   |   6 +
>>  3 files changed, 495 insertions(+)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index 5aaae1b..6c34117 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL
>>  	  This will be useful for platforms using the generic thermal interface
>>  	  and not the ACPI interface.
>>  
>> +config CPU_IDLE_THERMAL
>> +       bool "CPU idle cooling strategy"
>> +       depends on CPU_IDLE
>> +       help
>> +	 This implements the generic CPU cooling mechanism through
>> +	 idle injection.  This will throttle the CPU by injecting
>> +	 fixed idle cycle.  All CPUs belonging to the same cluster
>> +	 will enter idle synchronously to reach the deepest idle
>> +	 state.
>> +
>>  endchoice
>>  
>>  config CLOCK_THERMAL
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index 5c219dc..1eec8d6 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -10,18 +10,33 @@
>>   *		Viresh Kumar <viresh.kumar@...aro.org>
>>   *
>>   */
>> +#define pr_fmt(fmt) "CPU cooling: " fmt
>> +
>>  #include <linux/module.h>
>>  #include <linux/thermal.h>
>>  #include <linux/cpufreq.h>
>> +#include <linux/cpuidle.h>
>>  #include <linux/err.h>
>> +#include <linux/freezer.h>
>>  #include <linux/idr.h>
>> +#include <linux/kthread.h>
>>  #include <linux/pm_opp.h>
>>  #include <linux/slab.h>
>> +#include <linux/sched/prio.h>
>> +#include <linux/sched/rt.h>
>> +#include <linux/smpboot.h>
>>  #include <linux/cpu.h>
>>  #include <linux/cpu_cooling.h>
>>  
>> +#include <linux/ratelimit.h>
>> +
>> +#include <linux/platform_device.h>
>> +#include <linux/of_platform.h>
>> +
>>  #include <trace/events/thermal.h>
>>  
>> +#include <uapi/linux/sched/types.h>
>> +
>>  #ifdef CONFIG_CPU_FREQ_THERMAL
>>  /*
>>   * Cooling state <-> CPUFreq frequency
>> @@ -928,3 +943,467 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
>>  }
>>  EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
>>  #endif /* CONFIG_CPU_FREQ_THERMAL */
>> +
>> +#ifdef CONFIG_CPU_IDLE_THERMAL
>> +/**
>> + * struct cpuidle_cooling_device - data for the idle cooling device
>> + * @cdev: a pointer to a struct thermal_cooling_device
>> + * @cpumask: a cpumask containing the CPU managed by the cooling device
>> + * @timer: a hrtimer giving the tempo for the idle injection cycles
>> + * @kref: a kernel refcount on this structure
>> + * @count: an atomic to keep track of the last task exiting the idle cycle
>> + * @idle_cycle: an integer defining the duration of the idle injection
>> + * @state: an normalized integer giving the state of the cooling device
>> + */
>> +struct cpuidle_cooling_device {
>> +	struct thermal_cooling_device *cdev;
>> +	struct cpumask *cpumask;
>> +	struct hrtimer timer;
>> +	struct kref kref;
>> +	atomic_t count;
>> +	unsigned int idle_cycle;
>> +	unsigned long state;
>> +};
>> +
>> +struct cpuidle_cooling_thread {
>> +	struct task_struct *tsk;
>> +	int should_run;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct cpuidle_cooling_thread, cpuidle_cooling_thread);
>> +static DEFINE_PER_CPU(struct cpuidle_cooling_device *, cpuidle_cooling_device);
>> +
>> +/**
>> + * cpuidle_cooling_wakeup - Wake up all idle injection threads
>> + * @idle_cdev: the idle cooling device
>> + *
>> + * Every idle injection task belonging to the idle cooling device and
>> + * running on an online cpu will be wake up by this call.
>> + */
>> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
>> +{
>> +	struct cpuidle_cooling_thread *cct;
>> +	int cpu;
>> +
>> +	for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
>> +		cct = per_cpu_ptr(&cpuidle_cooling_thread, cpu);
>> +		cct->should_run = 1;
>> +		wake_up_process(cct->tsk);
>> +	}
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback
>> + * @timer: a hrtimer structure
>> + *
>> + * When the mitigation is acting, the CPU is allowed to run an amount
>> + * of time, then the idle injection happens for the specified delay
>> + * and the idle task injection schedules itself until the timer event
>> + * wakes the idle injection tasks again for a new idle injection
>> + * cycle. The time between the end of the idle injection and the timer
>> + * expiration is the allocated running time for the CPU.
>> + *
>> + * Always returns HRTIMER_NORESTART
>> + */
>> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
>> +{
>> +	struct cpuidle_cooling_device *idle_cdev =
>> +		container_of(timer, struct cpuidle_cooling_device, timer);
>> +
>> +	cpuidle_cooling_wakeup(idle_cdev);
>> +
>> +	return HRTIMER_NORESTART;
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_runtime - Running time computation
>> + * @idle_cdev: the idle cooling device
>> + *
>> + * The running duration is computed from the idle injection duration
>> + * which is fixed. If we reach 100% of idle injection ratio, that
>> + * means the running duration is zero. If we have a 50% ratio
>> + * injection, that means we have equal duration for idle and for
>> + * running duration.
>> + *
>> + * The formula is deduced as the following:
>> + *
>> + *  running = idle x ((100 / ratio) - 1)
>> + *
>> + * For precision purpose for integer math, we use the following:
>> + *
>> + *  running = (idle x 100) / ratio - idle
>> + *
>> + * For example, if we have an injected duration of 50%, then we end up
>> + * with 10ms of idle injection and 10ms of running duration.
>> + *
>> + * Returns a s64 nanosecond based
>> + */
>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>> +{
>> +	s64 next_wakeup;
>> +	unsigned long state = idle_cdev->state;
>> +
>> +	/*
>> +	 * The function should not be called when there is no
>> +	 * mitigation because:
>> +	 * - that does not make sense
>> +	 * - we end up with a division by zero
>> +	 */
>> +	if (!state)
>> +		return 0;
>> +
>> +	next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>> +		idle_cdev->idle_cycle;
>> +
>> +	return next_wakeup * NSEC_PER_USEC;
>> +}
>> +
> 
> There is a bug in your calculation formula here when "state" becomes 100.
> You return 0 for the injection rate, which is the same as "rate" being 0,
> which is dangerous. You stop cooling when it's most necessary :)
> 
> I'm not sure how much sense really being 100% idle makes, so I, when testing
> this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.
> 

oh and also, this breaks S3 suspend:

Aug  5 06:09:20 pureos kernel: [  807.487887] PM: suspend entry (deep)
Aug  5 06:09:40 pureos kernel: [  807.501148] Filesystems sync: 0.013
seconds
Aug  5 06:09:40 pureos kernel: [  807.501591] Freezing user space
processes ... (elapsed 0.003 seconds) done.
Aug  5 06:09:40 pureos kernel: [  807.504741] OOM killer disabled.
Aug  5 06:09:40 pureos kernel: [  807.504744] Freezing remaining
freezable tasks ...
Aug  5 06:09:40 pureos kernel: [  827.517712] Freezing of tasks failed
after 20.002 seconds (4 tasks refusing to freeze, wq_busy=0):
Aug  5 06:09:40 pureos kernel: [  827.527122] thermal-idle/0  S    0
161      2 0x00000028
Aug  5 06:09:40 pureos kernel: [  827.527131] Call trace:
Aug  5 06:09:40 pureos kernel: [  827.527148]  __switch_to+0xb4/0x200
Aug  5 06:09:40 pureos kernel: [  827.527156]  __schedule+0x1e0/0x488
Aug  5 06:09:40 pureos kernel: [  827.527162]  schedule+0x38/0xc8
Aug  5 06:09:40 pureos kernel: [  827.527169]  smpboot_thread_fn+0x250/0x2a8
Aug  5 06:09:40 pureos kernel: [  827.527176]  kthread+0xf4/0x120
Aug  5 06:09:40 pureos kernel: [  827.527182]  ret_from_fork+0x10/0x18
Aug  5 06:09:40 pureos kernel: [  827.527186] thermal-idle/1  S    0
162      2 0x00000028
Aug  5 06:09:40 pureos kernel: [  827.527192] Call trace:
Aug  5 06:09:40 pureos kernel: [  827.527197]  __switch_to+0x188/0x200
Aug  5 06:09:40 pureos kernel: [  827.527203]  __schedule+0x1e0/0x488
Aug  5 06:09:40 pureos kernel: [  827.527208]  schedule+0x38/0xc8
Aug  5 06:09:40 pureos kernel: [  827.527213]  smpboot_thread_fn+0x250/0x2a8
Aug  5 06:09:40 pureos kernel: [  827.527218]  kthread+0xf4/0x120
Aug  5 06:09:40 pureos kernel: [  827.527222]  ret_from_fork+0x10/0x18
Aug  5 06:09:40 pureos kernel: [  827.527226] thermal-idle/2  S    0
163      2 0x00000028
Aug  5 06:09:40 pureos kernel: [  827.527231] Call trace:
Aug  5 06:09:40 pureos kernel: [  827.527237]  __switch_to+0xb4/0x200
Aug  5 06:09:40 pureos kernel: [  827.527242]  __schedule+0x1e0/0x488
Aug  5 06:09:40 pureos kernel: [  827.527247]  schedule+0x38/0xc8
Aug  5 06:09:40 pureos kernel: [  827.527259]  smpboot_thread_fn+0x250/0x2a8
Aug  5 06:09:40 pureos kernel: [  827.527264]  kthread+0xf4/0x120
Aug  5 06:09:40 pureos kernel: [  827.527268]  ret_from_fork+0x10/0x18
Aug  5 06:09:40 pureos kernel: [  827.527272] thermal-idle/3  S    0
164      2 0x00000028
Aug  5 06:09:40 pureos kernel: [  827.527278] Call trace:
Aug  5 06:09:40 pureos kernel: [  827.527283]  __switch_to+0xb4/0x200
Aug  5 06:09:40 pureos kernel: [  827.527288]  __schedule+0x1e0/0x488
Aug  5 06:09:40 pureos kernel: [  827.527293]  schedule+0x38/0xc8
Aug  5 06:09:40 pureos kernel: [  827.527298]  smpboot_thread_fn+0x250/0x2a8
Aug  5 06:09:40 pureos kernel: [  827.527303]  kthread+0xf4/0x120
Aug  5 06:09:40 pureos kernel: [  827.527308]  ret_from_fork+0x10/0x18
Aug  5 06:09:40 pureos kernel: [  827.527375] Restarting kernel threads
... done.
Aug  5 06:09:40 pureos kernel: [  827.527771] OOM killer enabled.
Aug  5 06:09:40 pureos kernel: [  827.527772] Restarting tasks ... done.
Aug  5 06:09:40 pureos kernel: [  827.528926] PM: suspend exit


do you know where things might go wrong here?

thanks,

                            martin