lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 8 Feb 2015 23:32:31 -0800
From:	Steven Noonan <steven@...inklabs.net>
To:	Preeti U Murthy <preeti@...ux.vnet.ibm.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>, arjan@...ux.intel.com,
	Linux Kernel mailing List <linux-kernel@...r.kernel.org>,
	jacob.jun.pan@...el.com, fweisbec@...il.com, frederic@...nel.org,
	daniel.lezcano@...aro.org, amit.kucheria@...aro.org,
	edubezval@...il.com, viresh.kumar@...aro.org, rui.zhang@...el.com
Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use
 bandwidth control mechanism

On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy
<preeti@...ux.vnet.ibm.com> wrote:
> The powerclamp driver injects idle periods to stay within the thermal constraints.
> The driver does a fake idle by spawning per-cpu threads that call the mwait
> instruction. This behavior of fake idle can confuse the other kernel subsystems.
> For instance it calls into the nohz tick handlers, which are meant to be called
> only by the idle thread. It sets the state of the tick in each cpu to idle and
> stops the tick, when there are tasks on the runqueue. As a result the callers of
> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
> may be  inconsistency in the scheduler/nohz states which can lead to serious
> consequences. One of them was reported on this thread:
> https://lkml.org/lkml/2014/12/11/365.
>
> Thomas posted out a patch to disable the powerclamp driver from calling into the
> tick nohz code which has taken care of the above regression for the moment. However
> powerclamp driver as a result, will not be able to inject idle periods due to the
> presence of periodic ticks. With the current design of fake idle, we cannot move
> towards a better solution.
> https://lkml.org/lkml/2014/12/18/169
>
> This patch aims at removing the concept of fake idle and instead makes the cpus
> truly idle by throttling the runqueues during the idle injection periods. The situation
> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
> 'period' to achieve the same result. When the powerclamping is begun or when the
> clamping controls have been modified, the bandwidth for the root task group is set.
> The 'quota' will be the amount of time that the system needs to be busy and 'period'
> will be the sum of this busy duration and the idle duration as calculated by the driver.
> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
> the thread starting powerclamping will set the bandwidth and throttling of all cpus will
> automatically fall in place. None of the other cpus need be bothered about this. This
> simplifies the design of the driver.
>
> Of course this is only if the idle injection metrics can be conveniently transformed
> into bandwidth control metrics. There are a couple of other primary concerns around if
> doing the below two in this patch is valid.
> a. This patch exports the functions to set the quota and period of task groups.
> b. This patch removes the constraint of not being able to set the root task grp's bandwidth.
>
> Signed-off-by: Preeti U Murthy <preeti@...ux.vnet.ibm.com>

This doesn't compile.

Missing forward declaration:

drivers/thermal/intel_powerclamp.c: In function ‘window_size_set’:
drivers/thermal/intel_powerclamp.c:160:2: error: implicit declaration
of function ‘clamp_cpus’ [-Werror=implicit-function-declaration]
  clamp_cpus();
  ^
drivers/thermal/intel_powerclamp.c: At top level:
drivers/thermal/intel_powerclamp.c:355:12: error: static declaration
of ‘clamp_cpus’ follows non-static declaration
 static int clamp_cpus(void)
            ^
drivers/thermal/intel_powerclamp.c:160:2: note: previous implicit
declaration of ‘clamp_cpus’ was here
  clamp_cpus();
  ^


Missing RUNTIME_INF definition (only appears in kernel/sched/sched.h
from what I can see?):

drivers/thermal/intel_powerclamp.c: In function ‘clamp_cpus’:
drivers/thermal/intel_powerclamp.c:358:14: error: ‘RUNTIME_INF’
undeclared (first use in this function)
  u64 quota = RUNTIME_INF, period;
              ^
drivers/thermal/intel_powerclamp.c:358:14: note: each undeclared
identifier is reported only once for each function it appears in


Weird label placement that doesn't make sense:

drivers/thermal/intel_powerclamp.c:364:2: error: a label can only be
part of a statement and a declaration is not a statement
  int sleeptime;
  ^
drivers/thermal/intel_powerclamp.c:365:2: error: expected expression
before ‘unsigned’
  unsigned long target_jiffies;
  ^
drivers/thermal/intel_powerclamp.c:366:2: warning: ISO C90 forbids
mixed declarations and code [-Wdeclaration-after-statement]
  unsigned int guard;
  ^
drivers/thermal/intel_powerclamp.c:390:2: error: ‘target_jiffies’
undeclared (first use in this function)
  target_jiffies = roundup(jiffies, interval);
  ^


Missing debug variable:

drivers/thermal/intel_powerclamp.c: In function ‘powerclamp_debug_show’:
drivers/thermal/intel_powerclamp.c:598:41: error: ‘control_cpu’
undeclared (first use in this function)
  seq_printf(m, "controlling cpu: %d\n", control_cpu);
                                         ^

Missing label:

drivers/thermal/intel_powerclamp.c: In function ‘powerclamp_init’:
drivers/thermal/intel_powerclamp.c:649:3: error: label ‘exit_free’
used but not defined
   goto exit_free;
   ^
drivers/thermal/intel_powerclamp.c:644:6: warning: unused variable
‘bitmap_size’ [-Wunused-variable]
  int bitmap_size;
      ^

> ---
> V1: https://lkml.org/lkml/2014/12/21/23
>
> V1 tried to throttle cfs_rqs without modifying available bandwidth. This will not
> be an appropriate solution since the scheduler only considers a runqueue as
> throttled when its bandwidth has been exceeded. Going around this path can
> lead to unintended consequences. Hence V2 was designed to throttle runqueues
> through bandwidth control.
>
>  drivers/thermal/intel_powerclamp.c |  289 +++++++++---------------------------
>  include/linux/sched.h              |    6 +
>  kernel/sched/core.c                |    5 -
>  3 files changed, 82 insertions(+), 218 deletions(-)
>
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 6ceebd6..f43903c 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -78,20 +78,14 @@ static struct dentry *debug_dir;
>  static unsigned int set_target_ratio;
>  static unsigned int current_ratio;
>  static bool should_skip;
> +static unsigned int count = 0;
>  static bool reduce_irq;
>  static atomic_t idle_wakeup_counter;
> -static unsigned int control_cpu; /* The cpu assigned to collect stat and update
> -                                 * control parameters. default to BSP but BSP
> -                                 * can be offlined.
> -                                 */
>  static bool clamping;
> +static void clamp_timer_fn(unsigned long foo);
> +DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0);
>
> -
> -static struct task_struct * __percpu *powerclamp_thread;
>  static struct thermal_cooling_device *cooling_dev;
> -static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
> -                                          * clamping thread
> -                                          */
>
>  static unsigned int duration;
>  static unsigned int pkg_cstate_ratio_cur;
> @@ -163,7 +157,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp)
>         smp_mb();
>
>  exit_win:
> -
> +       clamp_cpus();
>         return ret;
>  }
>
> @@ -256,10 +250,6 @@ static u64 pkg_state_counter(void)
>         return count;
>  }
>
> -static void noop_timer(unsigned long foo)
> -{
> -       /* empty... just the fact that we get the interrupt wakes us up */
> -}
>
>  static unsigned int get_compensation(int ratio)
>  {
> @@ -362,102 +352,78 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
>         return set_target_ratio + guard <= current_ratio;
>  }
>
> -static int clamp_thread(void *arg)
> +static int clamp_cpus(void)
>  {
> -       int cpunr = (unsigned long)arg;
> -       DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
> -       static const struct sched_param param = {
> -               .sched_priority = MAX_USER_RT_PRIO/2,
> -       };
> -       unsigned int count = 0;
>         unsigned int target_ratio;
> +       u64 quota = RUNTIME_INF, period;
>
> -       set_bit(cpunr, cpu_clamping_mask);
> -       set_freezable();
> -       init_timer_on_stack(&wakeup_timer);
> -       sched_setscheduler(current, SCHED_FIFO, &param);
> -
> -       while (true == clamping && !kthread_should_stop() &&
> -               cpu_online(cpunr)) {
> -               int sleeptime;
> -               unsigned long target_jiffies;
> -               unsigned int guard;
> -               unsigned int compensation = 0;
> -               int interval; /* jiffies to sleep for each attempt */
> -               unsigned int duration_jiffies = msecs_to_jiffies(duration);
> -               unsigned int window_size_now;
> -
> -               try_to_freeze();
> -               /*
> -                * make sure user selected ratio does not take effect until
> -                * the next round. adjust target_ratio if user has changed
> -                * target such that we can converge quickly.
> -                */
> -               target_ratio = set_target_ratio;
> -               guard = 1 + target_ratio/20;
> -               window_size_now = window_size;
> -               count++;
> -
> -               /*
> -                * systems may have different ability to enter package level
> -                * c-states, thus we need to compensate the injected idle ratio
> -                * to achieve the actual target reported by the HW.
> -                */
> -               compensation = get_compensation(target_ratio);
> -               interval = duration_jiffies*100/(target_ratio+compensation);
> -
> -               /* align idle time */
> -               target_jiffies = roundup(jiffies, interval);
> -               sleeptime = target_jiffies - jiffies;
> -               if (sleeptime <= 0)
> -                       sleeptime = 1;
> -               schedule_timeout_interruptible(sleeptime);
> -               /*
> -                * only elected controlling cpu can collect stats and update
> -                * control parameters.
> -                */
> -               if (cpunr == control_cpu && !(count%window_size_now)) {
> -                       should_skip =
> -                               powerclamp_adjust_controls(target_ratio,
> -                                                       guard, window_size_now);
> -                       smp_mb();
> -               }
> +       if (clamping == false)
> +               goto out;
>
> -               if (should_skip)
> -                       continue;
> -
> -               target_jiffies = jiffies + duration_jiffies;
> -               mod_timer(&wakeup_timer, target_jiffies);
> -               if (unlikely(local_softirq_pending()))
> -                       continue;
> -               /*
> -                * stop tick sched during idle time, interrupts are still
> -                * allowed. thus jiffies are updated properly.
> -                */
> -               preempt_disable();
> -               /* mwait until target jiffies is reached */
> -               while (time_before(jiffies, target_jiffies)) {
> -                       unsigned long ecx = 1;
> -                       unsigned long eax = target_mwait;
> -
> -                       /*
> -                        * REVISIT: may call enter_idle() to notify drivers who
> -                        * can save power during cpu idle. same for exit_idle()
> -                        */
> -                       local_touch_nmi();
> -                       stop_critical_timings();
> -                       mwait_idle_with_hints(eax, ecx);
> -                       start_critical_timings();
> -                       atomic_inc(&idle_wakeup_counter);
> -               }
> -               preempt_enable();
> +again:
> +       int sleeptime;
> +       unsigned long target_jiffies;
> +       unsigned int guard;
> +       unsigned int compensation = 0;
> +       int interval; /* jiffies to sleep for each attempt */
> +       unsigned int duration_jiffies = msecs_to_jiffies(duration);
> +       unsigned int window_size_now;
> +
> +       /*
> +        * make sure user selected ratio does not take effect until
> +        * the next round. adjust target_ratio if user has changed
> +        * target such that we can converge quickly.
> +        */
> +       target_ratio = set_target_ratio;
> +       guard = 1 + target_ratio/20;
> +       window_size_now = window_size;
> +
> +       /*
> +        * systems may have different ability to enter package level
> +        * c-states, thus we need to compensate the injected idle ratio
> +        * to achieve the actual target reported by the HW.
> +        */
> +       compensation = get_compensation(target_ratio);
> +       interval = duration_jiffies*100/(target_ratio+compensation);
> +
> +       /* align idle time */
> +       target_jiffies = roundup(jiffies, interval);
> +       sleeptime = target_jiffies - jiffies;
> +       if (sleeptime <= 0)
> +               sleeptime = 1;
> +
> +       if (!(count%window_size_now)) {
> +               should_skip =
> +                       powerclamp_adjust_controls(target_ratio,
> +                                               guard, window_size_now);
> +               smp_mb();
>         }
> -       del_timer_sync(&wakeup_timer);
> -       clear_bit(cpunr, cpu_clamping_mask);
> +
> +       if (should_skip)
> +               goto again;
> +
> +       target_jiffies = jiffies + sleeptime + duration_jiffies;
> +       mod_timer(&wakeup_timer, target_jiffies);
> +       if (unlikely(local_softirq_pending()))
> +               goto again;
> +
> +       quota = jiffies_to_usecs(sleeptime);
> +       period = jiffies_to_usecs(sleeptime + duration_jiffies);
> +
> +out:
> +       tg_set_cfs_quota(&root_task_group, quota);
> +       tg_set_cfs_period(&root_task_group, period);
>
>         return 0;
>  }
>
> +static void clamp_timer_fn(unsigned long foo)
> +{
> +       /* Evaluate to see if clamping controls need to be adjusted */
> +       count++;
> +       clamp_cpus();
> +}
> +
>  /*
>   * 1 HZ polling while clamping is active, useful for userspace
>   * to monitor actual idle ratio.
> @@ -501,8 +467,7 @@ static void poll_pkg_cstate(struct work_struct *dummy)
>
>  static int start_power_clamp(void)
>  {
> -       unsigned long cpu;
> -       struct task_struct *thread;
> +       clamping = true;
>
>         /* check if pkg cstate counter is completely 0, abort in this case */
>         if (!has_pkg_state_counter()) {
> @@ -511,108 +476,21 @@ static int start_power_clamp(void)
>         }
>
>         set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
> -       /* prevent cpu hotplug */
> -       get_online_cpus();
> -
> -       /* prefer BSP */
> -       control_cpu = 0;
> -       if (!cpu_online(control_cpu))
> -               control_cpu = smp_processor_id();
>
> -       clamping = true;
>         schedule_delayed_work(&poll_pkg_cstate_work, 0);
> -
> -       /* start one thread per online cpu */
> -       for_each_online_cpu(cpu) {
> -               struct task_struct **p =
> -                       per_cpu_ptr(powerclamp_thread, cpu);
> -
> -               thread = kthread_create_on_node(clamp_thread,
> -                                               (void *) cpu,
> -                                               cpu_to_node(cpu),
> -                                               "kidle_inject/%ld", cpu);
> -               /* bind to cpu here */
> -               if (likely(!IS_ERR(thread))) {
> -                       kthread_bind(thread, cpu);
> -                       wake_up_process(thread);
> -                       *p = thread;
> -               }
> -
> -       }
> -       put_online_cpus();
> +       clamp_cpus();
>
>         return 0;
>  }
>
>  static void end_power_clamp(void)
>  {
> -       int i;
> -       struct task_struct *thread;
> -
>         clamping = false;
> -       /*
> -        * make clamping visible to other cpus and give per cpu clamping threads
> -        * sometime to exit, or gets killed later.
> -        */
> -       smp_mb();
> -       msleep(20);
> -       if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
> -               for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
> -                       pr_debug("clamping thread for cpu %d alive, kill\n", i);
> -                       thread = *per_cpu_ptr(powerclamp_thread, i);
> -                       kthread_stop(thread);
> -               }
> -       }
> -}
>
> -static int powerclamp_cpu_callback(struct notifier_block *nfb,
> -                               unsigned long action, void *hcpu)
> -{
> -       unsigned long cpu = (unsigned long)hcpu;
> -       struct task_struct *thread;
> -       struct task_struct **percpu_thread =
> -               per_cpu_ptr(powerclamp_thread, cpu);
> -
> -       if (false == clamping)
> -               goto exit_ok;
> -
> -       switch (action) {
> -       case CPU_ONLINE:
> -               thread = kthread_create_on_node(clamp_thread,
> -                                               (void *) cpu,
> -                                               cpu_to_node(cpu),
> -                                               "kidle_inject/%lu", cpu);
> -               if (likely(!IS_ERR(thread))) {
> -                       kthread_bind(thread, cpu);
> -                       wake_up_process(thread);
> -                       *percpu_thread = thread;
> -               }
> -               /* prefer BSP as controlling CPU */
> -               if (cpu == 0) {
> -                       control_cpu = 0;
> -                       smp_mb();
> -               }
> -               break;
> -       case CPU_DEAD:
> -               if (test_bit(cpu, cpu_clamping_mask)) {
> -                       pr_err("cpu %lu dead but powerclamping thread is not\n",
> -                               cpu);
> -                       kthread_stop(*percpu_thread);
> -               }
> -               if (cpu == control_cpu) {
> -                       control_cpu = smp_processor_id();
> -                       smp_mb();
> -               }
> -       }
> -
> -exit_ok:
> -       return NOTIFY_OK;
> +       clamp_cpus();
> +       del_timer_sync(&wakeup_timer);
>  }
>
> -static struct notifier_block powerclamp_cpu_notifier = {
> -       .notifier_call = powerclamp_cpu_callback,
> -};
> -
>  static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
>                                  unsigned long *state)
>  {
> @@ -656,6 +534,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
>         }
>
>  exit_set:
> +       clamp_cpus();
>         return ret;
>  }
>
> @@ -764,11 +643,6 @@ static int powerclamp_init(void)
>         int retval;
>         int bitmap_size;
>
> -       bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
> -       cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
> -       if (!cpu_clamping_mask)
> -               return -ENOMEM;
> -
>         /* probe cpu features and ids here */
>         retval = powerclamp_probe();
>         if (retval)
> @@ -776,19 +650,12 @@ static int powerclamp_init(void)
>
>         /* set default limit, maybe adjusted during runtime based on feedback */
>         window_size = 2;
> -       register_hotcpu_notifier(&powerclamp_cpu_notifier);
> -
> -       powerclamp_thread = alloc_percpu(struct task_struct *);
> -       if (!powerclamp_thread) {
> -               retval = -ENOMEM;
> -               goto exit_unregister;
> -       }
>
>         cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
>                                                 &powerclamp_cooling_ops);
>         if (IS_ERR(cooling_dev)) {
>                 retval = -ENODEV;
> -               goto exit_free_thread;
> +               goto exit;
>         }
>
>         if (!duration)
> @@ -798,23 +665,15 @@ static int powerclamp_init(void)
>
>         return 0;
>
> -exit_free_thread:
> -       free_percpu(powerclamp_thread);
> -exit_unregister:
> -       unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
> -exit_free:
> -       kfree(cpu_clamping_mask);
> +exit:
>         return retval;
>  }
>  module_init(powerclamp_init);
>
>  static void powerclamp_exit(void)
>  {
> -       unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
>         end_power_clamp();
> -       free_percpu(powerclamp_thread);
>         thermal_cooling_device_unregister(cooling_dev);
> -       kfree(cpu_clamping_mask);
>
>         cancel_delayed_work_sync(&poll_pkg_cstate_work);
>         debugfs_remove_recursive(debug_dir);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8db31ef..6a7ccb2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3002,6 +3002,12 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
>
>  #ifdef CONFIG_CGROUP_SCHED
>  extern struct task_group root_task_group;
> +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
> +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
> +#else
> +
> +static inline int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
> +static inline int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
>  #endif /* CONFIG_CGROUP_SCHED */
>
>  extern int task_can_switch_user(struct user_struct *up,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e628cb1..f47621a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8059,9 +8059,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>         int i, ret = 0, runtime_enabled, runtime_was_enabled;
>         struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>
> -       if (tg == &root_task_group)
> -               return -EINVAL;
> -
>         /*
>          * Ensure we have at some amount of bandwidth every period.  This is
>          * to prevent reaching a state of large arrears when throttled via
> @@ -8141,6 +8138,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
>
>         return tg_set_cfs_bandwidth(tg, period, quota);
>  }
> +EXPORT_SYMBOL_GPL(tg_set_cfs_quota);
>
>  long tg_get_cfs_quota(struct task_group *tg)
>  {
> @@ -8164,6 +8162,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
>
>         return tg_set_cfs_bandwidth(tg, period, quota);
>  }
> +EXPORT_SYMBOL_GPL(tg_set_cfs_period);
>
>  long tg_get_cfs_period(struct task_group *tg)
>  {
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ