linux-kernel - Re: [RFCv4 PATCH 31/34] sched: Energy-aware wake-up task placement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:	Thu, 14 May 2015 16:10:11 +0100
From:	Morten Rasmussen <morten.rasmussen@....com>
To:	"pang.xunlei@....com.cn" <pang.xunlei@....com.cn>
Cc:	Dietmar Eggemann <Dietmar.Eggemann@....com>,
	Juri Lelli <Juri.Lelli@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"mturquette@...aro.org" <mturquette@...aro.org>,
	"peterz@...radead.org" <peterz@...radead.org>,
	"preeti@...ux.vnet.ibm.com" <preeti@...ux.vnet.ibm.com>,
	"rjw@...ysocki.net" <rjw@...ysocki.net>,
	"sgurrappadi@...dia.com" <sgurrappadi@...dia.com>,
	"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
	"yuyang.du@...el.com" <yuyang.du@...el.com>
Subject: Re: [RFCv4 PATCH 31/34] sched: Energy-aware wake-up task placement

On Thu, May 14, 2015 at 10:34:20AM +0100, pang.xunlei@....com.cn wrote:
> Morten Rasmussen <morten.rasmussen@....com> wrote 2015-05-13 AM 03:39:06:
> > [RFCv4 PATCH 31/34] sched: Energy-aware wake-up task placement
> >
> > Let available compute capacity and estimated energy impact select
> > wake-up target cpu when energy-aware scheduling is enabled and the
> > system in not over-utilized (above the tipping point).
> >
> > energy_aware_wake_cpu() attempts to find group of cpus with sufficient
> > compute capacity to accommodate the task and find a cpu with enough spare
> > capacity to handle the task within that group. Preference is given to
> > cpus with enough spare capacity at the current OPP. Finally, the energy
> > impact of the new target and the previous task cpu is compared to select
> > the wake-up target cpu.
> >
> > cc: Ingo Molnar <mingo@...hat.com>
> > cc: Peter Zijlstra <peterz@...radead.org>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@....com>
> > ---
> >  kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++
> > ++++++++++-
> >  1 file changed, 84 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bb44646..fe41e1e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5394,6 +5394,86 @@ static int select_idle_sibling(struct
> > task_struct *p, int target)
> >     return target;
> >  }
> >
> > +static int energy_aware_wake_cpu(struct task_struct *p)
> > +{
> > +   struct sched_domain *sd;
> > +   struct sched_group *sg, *sg_target;
> > +   int target_max_cap = INT_MAX;
> > +   int target_cpu = task_cpu(p);
> > +   int i;
> > +
> > +   sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> > +
> > +   if (!sd)
> > +      return -1;
> > +
> > +   sg = sd->groups;
> > +   sg_target = sg;
> > +
> > +   /*
> > +    * Find group with sufficient capacity. We only get here if no cpu is
> > +    * overutilized. We may end up overutilizing a cpu by adding the task,
> > +    * but that should not be any worse than select_idle_sibling().
> > +    * load_balance() should sort it out later as we get above the tipping
> > +    * point.
> > +    */
> > +   do {
> > +      /* Assuming all cpus are the same in group */
> > +      int max_cap_cpu = group_first_cpu(sg);
> > +
> > +      /*
> > +       * Assume smaller max capacity means more energy-efficient.
> > +       * Ideally we should query the energy model for the right
> > +       * answer but it easily ends up in an exhaustive search.
> > +       */
> > +      if (capacity_of(max_cap_cpu) < target_max_cap &&
> > +          task_fits_capacity(p, max_cap_cpu)) {
> > +         sg_target = sg;
> > +         target_max_cap = capacity_of(max_cap_cpu);
> > +      }
> > +   } while (sg = sg->next, sg != sd->groups);
> > +
> > +   /* Find cpu with sufficient capacity */
> > +   for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> > +      /*
> > +       * p's blocked utilization is still accounted for on prev_cpu
> > +       * so prev_cpu will receive a negative bias due the double
> > +       * accouting. However, the blocked utilization may be zero.
> > +       */
> > +      int new_usage = get_cpu_usage(i) + task_utilization(p);
> > +
> > +      if (new_usage >   capacity_orig_of(i))
> > +         continue;
> > +
> > +      if (new_usage <   capacity_curr_of(i)) {
> > +         target_cpu = i;
> > +         if (cpu_rq(i)->nr_running)
> > +            break;
> > +      }
> > +
> > +      /* cpu has capacity at higher OPP, keep it as fallback */
> > +      if (target_cpu == task_cpu(p))
> > +         target_cpu = i;
> > +   }
> > +
> > +   if (target_cpu != task_cpu(p)) {
> > +      struct energy_env eenv = {
> > +         .usage_delta   = task_utilization(p),
> > +         .src_cpu   = task_cpu(p),
> > +         .dst_cpu   = target_cpu,
> > +      };
> 
> At this point, p hasn't been queued in src_cpu, but energy_diff() below will
> still substract its utilization from src_cpu, is that right?

energy_aware_wake_cpu() should only be called for existing tasks, i.e.
SD_BALANCE_WAKE, so p should have been queued on src_cpu in the past.
New tasks (SD_BALANCE_FORK) take the find_idlest_{group, cpu}() route.

Or did I miss something?

Since p was last scheduled on src_cpu its usage should still be
accounted for in the blocked utilization of that cpu. At wake-up we are
effectively turning blocked utilization into runnable utilization. The
cpu usage (get_cpu_usage()) is the sum of the two and this is basis for
the energy calculations. So if we migrate the task at wake-up we should
remove the task utilization from the previous cpu and add it to dst_cpu.

As Sai has raised previously, it is not the full story. The blocked
utilization contribution of p on the previous cpu may have decayed while
the task utilization stored in p->se.avg has not. It is therefore
misleading to subtract the non-decayed utilization from src_cpu blocked
utilization. It is on the todo-list to fix that issue.

Does that make any sense?

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/