linux-kernel - Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 09 Feb 2015 17:21:34 +0530
From:	Preeti U Murthy <preeti@...ux.vnet.ibm.com>
To:	Steven Noonan <steven@...inklabs.net>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>, jacob.jun.pan@...el.com
CC:	arjan@...ux.intel.com,
	Linux Kernel mailing List <linux-kernel@...r.kernel.org>,
	fweisbec@...il.com, frederic@...nel.org, daniel.lezcano@...aro.org,
	amit.kucheria@...aro.org, edubezval@...il.com,
	viresh.kumar@...aro.org, rui.zhang@...el.com
Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to
 use bandwidth control mechanism

Hi Steven,

On 02/09/2015 01:02 PM, Steven Noonan wrote:
> On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy
> <preeti@...ux.vnet.ibm.com> wrote:
>> The powerclamp driver injects idle periods to stay within the thermal constraints.
>> The driver does a fake idle by spawning per-cpu threads that call the mwait
>> instruction. This behavior of fake idle can confuse the other kernel subsystems.
>> For instance it calls into the nohz tick handlers, which are meant to be called
>> only by the idle thread. It sets the state of the tick in each cpu to idle and
>> stops the tick, when there are tasks on the runqueue. As a result the callers of
>> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
>> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
>> may be  inconsistency in the scheduler/nohz states which can lead to serious
>> consequences. One of them was reported on this thread:
>> https://lkml.org/lkml/2014/12/11/365.
>>
>> Thomas posted out a patch to disable the powerclamp driver from calling into the
>> tick nohz code which has taken care of the above regression for the moment. However
>> powerclamp driver as a result, will not be able to inject idle periods due to the
>> presence of periodic ticks. With the current design of fake idle, we cannot move
>> towards a better solution.
>> https://lkml.org/lkml/2014/12/18/169
>>
>> This patch aims at removing the concept of fake idle and instead makes the cpus
>> truly idle by throttling the runqueues during the idle injection periods. The situation
>> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
>> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
>> 'period' to achieve the same result. When the powerclamping is begun or when the
>> clamping controls have been modified, the bandwidth for the root task group is set.
>> The 'quota' will be the amount of time that the system needs to be busy and 'period'
>> will be the sum of this busy duration and the idle duration as calculated by the driver.
>> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
>> the thread starting powerclamping will set the bandwidth and throttling of all cpus will
>> automatically fall in place. None of the other cpus need be bothered about this. This
>> simplifies the design of the driver.
>>
>> Of course this is only if the idle injection metrics can be conveniently transformed
>> into bandwidth control metrics. There are a couple of other primary concerns around if
>> doing the below two in this patch is valid.
>> a. This patch exports the functions to set the quota and period of task groups.
>> b. This patch removes the constraint of not being able to set the root task grp's bandwidth.
>>
>> Signed-off-by: Preeti U Murthy <preeti@...ux.vnet.ibm.com>
> 
> This doesn't compile.

Thanks for reporting this! I realized that I had not compiled in the powerclamp driver
as a module while compile testing it. I was focusing on the issues with the design and
failed to cross verify this. Apologies for the inconvenience.

Find the diff compile tested below. 

I also realized that clamp_cpus() that sets the bandwidth cannot be called from
multiple places. Currently I am calling it from end_powerclamp(), when the user changes the
idle clamping duration and from a queued timer. This will require synchronization between
callers which is not really called for. The queued wakeup_timer alone can re-evaluate the
clamping metrics after every throttle-unthrottle period and this should suffice as far
as I can see. Thoughts ?

Regards
Preeti U Murthy

----------------------------------------------------------------------------------

V2 of intel_powerclamp driver

From: Preeti U Murthy <preeti@...ux.vnet.ibm.com>


---
 drivers/thermal/Kconfig            |    1 
 drivers/thermal/intel_powerclamp.c |  301 ++++++++++--------------------------
 include/linux/sched.h              |    9 +
 kernel/sched/core.c                |    6 -
 kernel/sched/sched.h               |    5 -
 5 files changed, 95 insertions(+), 227 deletions(-)

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index af40db0..4b7cd02 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -233,6 +233,7 @@ config INTEL_POWERCLAMP
 	depends on THERMAL
 	depends on X86
 	depends on CPU_SUP_INTEL
+	select CFS_BANDWIDTH
 	help
 	  Enable this to enable Intel PowerClamp idle injection driver. This
 	  enforce idle time which results in more package C-state residency. The
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 6ceebd6..4bd07bb 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -51,6 +51,7 @@
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 #include <linux/sched/rt.h>
+#include <linux/sched.h>
 
 #include <asm/nmi.h>
 #include <asm/msr.h>
@@ -78,20 +79,18 @@ static struct dentry *debug_dir;
 static unsigned int set_target_ratio;
 static unsigned int current_ratio;
 static bool should_skip;
+static unsigned int count = 0;
 static bool reduce_irq;
 static atomic_t idle_wakeup_counter;
-static unsigned int control_cpu; /* The cpu assigned to collect stat and update
-				  * control parameters. default to BSP but BSP
-				  * can be offlined.
-				  */
 static bool clamping;
 
+/* Timer that evaluates bandwidth reset if clamping knobs have changed */
+static void clamp_timer_fn(unsigned long foo);
+DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0);
+
+static void clamp_cpus(void);
 
-static struct task_struct * __percpu *powerclamp_thread;
 static struct thermal_cooling_device *cooling_dev;
-static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
-					   * clamping thread
-					   */
 
 static unsigned int duration;
 static unsigned int pkg_cstate_ratio_cur;
@@ -163,7 +162,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp)
 	smp_mb();
 
 exit_win:
-
+	clamp_cpus();
 	return ret;
 }
 
@@ -256,10 +255,6 @@ static u64 pkg_state_counter(void)
 	return count;
 }
 
-static void noop_timer(unsigned long foo)
-{
-	/* empty... just the fact that we get the interrupt wakes us up */
-}
 
 static unsigned int get_compensation(int ratio)
 {
@@ -362,100 +357,77 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
 	return set_target_ratio + guard <= current_ratio;
 }
 
-static int clamp_thread(void *arg)
+static void clamp_cpus(void)
 {
-	int cpunr = (unsigned long)arg;
-	DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
-	static const struct sched_param param = {
-		.sched_priority = MAX_USER_RT_PRIO/2,
-	};
-	unsigned int count = 0;
 	unsigned int target_ratio;
+	u64 quota, period;
+	int sleeptime;
+	unsigned long target_jiffies;
+	unsigned int guard;
+	unsigned int compensation;
+	int interval; /* jiffies to sleep for each attempt */
+	unsigned int duration_jiffies;
+	unsigned int window_size_now;
+
+again:
+	if (clamping == false)
+		goto out;
+
+	quota = RUNTIME_INF;
+	period = NSEC_PER_SEC;
+	compensation = 0;
+	duration_jiffies = msecs_to_jiffies(duration);
+	/*
+	 * make sure user selected ratio does not take effect until
+	 * the next round. adjust target_ratio if user has changed
+	 * target such that we can converge quickly.
+	 */
+	target_ratio = set_target_ratio;
+	guard = 1 + target_ratio/20;
+	window_size_now = window_size;
 
-	set_bit(cpunr, cpu_clamping_mask);
-	set_freezable();
-	init_timer_on_stack(&wakeup_timer);
-	sched_setscheduler(current, SCHED_FIFO, &param);
-
-	while (true == clamping && !kthread_should_stop() &&
-		cpu_online(cpunr)) {
-		int sleeptime;
-		unsigned long target_jiffies;
-		unsigned int guard;
-		unsigned int compensation = 0;
-		int interval; /* jiffies to sleep for each attempt */
-		unsigned int duration_jiffies = msecs_to_jiffies(duration);
-		unsigned int window_size_now;
-
-		try_to_freeze();
-		/*
-		 * make sure user selected ratio does not take effect until
-		 * the next round. adjust target_ratio if user has changed
-		 * target such that we can converge quickly.
-		 */
-		target_ratio = set_target_ratio;
-		guard = 1 + target_ratio/20;
-		window_size_now = window_size;
-		count++;
-
-		/*
-		 * systems may have different ability to enter package level
-		 * c-states, thus we need to compensate the injected idle ratio
-		 * to achieve the actual target reported by the HW.
-		 */
-		compensation = get_compensation(target_ratio);
-		interval = duration_jiffies*100/(target_ratio+compensation);
-
-		/* align idle time */
-		target_jiffies = roundup(jiffies, interval);
-		sleeptime = target_jiffies - jiffies;
-		if (sleeptime <= 0)
-			sleeptime = 1;
-		schedule_timeout_interruptible(sleeptime);
-		/*
-		 * only elected controlling cpu can collect stats and update
-		 * control parameters.
-		 */
-		if (cpunr == control_cpu && !(count%window_size_now)) {
-			should_skip =
-				powerclamp_adjust_controls(target_ratio,
-							guard, window_size_now);
-			smp_mb();
-		}
-
-		if (should_skip)
-			continue;
-
-		target_jiffies = jiffies + duration_jiffies;
-		mod_timer(&wakeup_timer, target_jiffies);
-		if (unlikely(local_softirq_pending()))
-			continue;
-		/*
-		 * stop tick sched during idle time, interrupts are still
-		 * allowed. thus jiffies are updated properly.
-		 */
-		preempt_disable();
-		/* mwait until target jiffies is reached */
-		while (time_before(jiffies, target_jiffies)) {
-			unsigned long ecx = 1;
-			unsigned long eax = target_mwait;
-
-			/*
-			 * REVISIT: may call enter_idle() to notify drivers who
-			 * can save power during cpu idle. same for exit_idle()
-			 */
-			local_touch_nmi();
-			stop_critical_timings();
-			mwait_idle_with_hints(eax, ecx);
-			start_critical_timings();
-			atomic_inc(&idle_wakeup_counter);
-		}
-		preempt_enable();
+	/*
+	 * systems may have different ability to enter package level
+	 * c-states, thus we need to compensate the injected idle ratio
+	 * to achieve the actual target reported by the HW.
+	 */
+	compensation = get_compensation(target_ratio);
+	interval = duration_jiffies*100/(target_ratio+compensation);
+
+	/* align idle time */
+	target_jiffies = roundup(jiffies, interval);
+	sleeptime = target_jiffies - jiffies;
+	if (sleeptime <= 0)
+		sleeptime = 1;
+
+	if (!(count%window_size_now)) {
+		should_skip =
+			powerclamp_adjust_controls(target_ratio,
+						guard, window_size_now);
+		smp_mb();
 	}
-	del_timer_sync(&wakeup_timer);
-	clear_bit(cpunr, cpu_clamping_mask);
 
-	return 0;
+	if (should_skip)
+		goto again;
+
+	target_jiffies = jiffies + sleeptime + duration_jiffies;
+	mod_timer(&wakeup_timer, target_jiffies);
+	if (unlikely(local_softirq_pending()))
+		goto again;
+
+	quota = jiffies_to_usecs(sleeptime);
+	period = jiffies_to_usecs(sleeptime + duration_jiffies);
+
+out:
+	tg_set_cfs_quota(&root_task_group, quota);
+	tg_set_cfs_period(&root_task_group, period);
+}
+
+static void clamp_timer_fn(unsigned long foo)
+{
+	/* Evaluate to see if clamping controls need to be adjusted */
+	count++;
+	clamp_cpus();
 }
 
 /*
@@ -501,8 +473,7 @@ static void poll_pkg_cstate(struct work_struct *dummy)
 
 static int start_power_clamp(void)
 {
-	unsigned long cpu;
-	struct task_struct *thread;
+	clamping = true;
 
 	/* check if pkg cstate counter is completely 0, abort in this case */
 	if (!has_pkg_state_counter()) {
@@ -511,108 +482,21 @@ static int start_power_clamp(void)
 	}
 
 	set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
-	/* prevent cpu hotplug */
-	get_online_cpus();
-
-	/* prefer BSP */
-	control_cpu = 0;
-	if (!cpu_online(control_cpu))
-		control_cpu = smp_processor_id();
 
-	clamping = true;
 	schedule_delayed_work(&poll_pkg_cstate_work, 0);
-
-	/* start one thread per online cpu */
-	for_each_online_cpu(cpu) {
-		struct task_struct **p =
-			per_cpu_ptr(powerclamp_thread, cpu);
-
-		thread = kthread_create_on_node(clamp_thread,
-						(void *) cpu,
-						cpu_to_node(cpu),
-						"kidle_inject/%ld", cpu);
-		/* bind to cpu here */
-		if (likely(!IS_ERR(thread))) {
-			kthread_bind(thread, cpu);
-			wake_up_process(thread);
-			*p = thread;
-		}
-
-	}
-	put_online_cpus();
+	clamp_cpus();
 
 	return 0;
 }
 
 static void end_power_clamp(void)
 {
-	int i;
-	struct task_struct *thread;
-
 	clamping = false;
-	/*
-	 * make clamping visible to other cpus and give per cpu clamping threads
-	 * sometime to exit, or gets killed later.
-	 */
-	smp_mb();
-	msleep(20);
-	if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
-		for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
-			pr_debug("clamping thread for cpu %d alive, kill\n", i);
-			thread = *per_cpu_ptr(powerclamp_thread, i);
-			kthread_stop(thread);
-		}
-	}
-}
 
-static int powerclamp_cpu_callback(struct notifier_block *nfb,
-				unsigned long action, void *hcpu)
-{
-	unsigned long cpu = (unsigned long)hcpu;
-	struct task_struct *thread;
-	struct task_struct **percpu_thread =
-		per_cpu_ptr(powerclamp_thread, cpu);
-
-	if (false == clamping)
-		goto exit_ok;
-
-	switch (action) {
-	case CPU_ONLINE:
-		thread = kthread_create_on_node(clamp_thread,
-						(void *) cpu,
-						cpu_to_node(cpu),
-						"kidle_inject/%lu", cpu);
-		if (likely(!IS_ERR(thread))) {
-			kthread_bind(thread, cpu);
-			wake_up_process(thread);
-			*percpu_thread = thread;
-		}
-		/* prefer BSP as controlling CPU */
-		if (cpu == 0) {
-			control_cpu = 0;
-			smp_mb();
-		}
-		break;
-	case CPU_DEAD:
-		if (test_bit(cpu, cpu_clamping_mask)) {
-			pr_err("cpu %lu dead but powerclamping thread is not\n",
-				cpu);
-			kthread_stop(*percpu_thread);
-		}
-		if (cpu == control_cpu) {
-			control_cpu = smp_processor_id();
-			smp_mb();
-		}
-	}
-
-exit_ok:
-	return NOTIFY_OK;
+	clamp_cpus();
+	del_timer_sync(&wakeup_timer);
 }
 
-static struct notifier_block powerclamp_cpu_notifier = {
-	.notifier_call = powerclamp_cpu_callback,
-};
-
 static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
 				 unsigned long *state)
 {
@@ -656,6 +540,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
 	}
 
 exit_set:
+	clamp_cpus();
 	return ret;
 }
 
@@ -716,7 +601,6 @@ static int powerclamp_debug_show(struct seq_file *m, void *unused)
 {
 	int i = 0;
 
-	seq_printf(m, "controlling cpu: %d\n", control_cpu);
 	seq_printf(m, "pct confidence steady dynamic (compensation)\n");
 	for (i = 0; i < MAX_TARGET_RATIO; i++) {
 		seq_printf(m, "%d\t%lu\t%lu\t%lu\n",
@@ -762,33 +646,20 @@ file_error:
 static int powerclamp_init(void)
 {
 	int retval;
-	int bitmap_size;
-
-	bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
-	cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
-	if (!cpu_clamping_mask)
-		return -ENOMEM;
 
 	/* probe cpu features and ids here */
 	retval = powerclamp_probe();
 	if (retval)
-		goto exit_free;
+		goto exit;
 
 	/* set default limit, maybe adjusted during runtime based on feedback */
 	window_size = 2;
-	register_hotcpu_notifier(&powerclamp_cpu_notifier);
-
-	powerclamp_thread = alloc_percpu(struct task_struct *);
-	if (!powerclamp_thread) {
-		retval = -ENOMEM;
-		goto exit_unregister;
-	}
 
 	cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
 						&powerclamp_cooling_ops);
 	if (IS_ERR(cooling_dev)) {
 		retval = -ENODEV;
-		goto exit_free_thread;
+		goto exit;
 	}
 
 	if (!duration)
@@ -798,23 +669,15 @@ static int powerclamp_init(void)
 
 	return 0;
 
-exit_free_thread:
-	free_percpu(powerclamp_thread);
-exit_unregister:
-	unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
-exit_free:
-	kfree(cpu_clamping_mask);
+exit:
 	return retval;
 }
 module_init(powerclamp_init);
 
 static void powerclamp_exit(void)
 {
-	unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
 	end_power_clamp();
-	free_percpu(powerclamp_thread);
 	thermal_cooling_device_unregister(cooling_dev);
-	kfree(cpu_clamping_mask);
 
 	cancel_delayed_work_sync(&poll_pkg_cstate_work);
 	debugfs_remove_recursive(debug_dir);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..2493942 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -163,6 +163,11 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
 	load += n*(FIXED_1-exp); \
 	load >>= FSHIFT;
 
+/*
+ * single value that denotes runtime == period, ie unlimited time.
+ */
+#define RUNTIME_INF	((u64)~0ULL)
+
 extern unsigned long total_forks;
 extern int nr_threads;
 DECLARE_PER_CPU(unsigned long, process_counts);
@@ -3002,6 +3007,10 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
 #ifdef CONFIG_CGROUP_SCHED
 extern struct task_group root_task_group;
+#ifdef CONFIG_CFS_BANDWIDTH
+extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
+extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_CGROUP_SCHED */
 
 extern int task_can_switch_user(struct user_struct *up,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e628cb1..7471b06 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7097,6 +7097,7 @@ int in_sched_functions(unsigned long addr)
  * Every task in system belongs to this group at bootup.
  */
 struct task_group root_task_group;
+EXPORT_SYMBOL_GPL(root_task_group);
 LIST_HEAD(task_groups);
 #endif
 
@@ -8059,9 +8060,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
 
-	if (tg == &root_task_group)
-		return -EINVAL;
-
 	/*
 	 * Ensure we have at some amount of bandwidth every period.  This is
 	 * to prevent reaching a state of large arrears when throttled via
@@ -8141,6 +8139,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 
 	return tg_set_cfs_bandwidth(tg, period, quota);
 }
+EXPORT_SYMBOL_GPL(tg_set_cfs_quota);
 
 long tg_get_cfs_quota(struct task_group *tg)
 {
@@ -8164,6 +8163,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 
 	return tg_set_cfs_bandwidth(tg, period, quota);
 }
+EXPORT_SYMBOL_GPL(tg_set_cfs_period);
 
 long tg_get_cfs_period(struct task_group *tg)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a2a45c..20493e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -72,11 +72,6 @@ extern void update_cpu_load_active(struct rq *this_rq);
  * These are the 'tuning knobs' of the scheduler:
  */
 
-/*
- * single value that denotes runtime == period, ie unlimited time.
- */
-#define RUNTIME_INF	((u64)~0ULL)
-
 static inline int fair_policy(int policy)
 {
 	return policy == SCHED_NORMAL || policy == SCHED_BATCH;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/