linux-kernel - Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140107125518.GE2936@e103034-lin>
Date:	Tue, 7 Jan 2014 12:55:18 +0000
From:	Morten Rasmussen <morten.rasmussen@....com>
To:	Alex Shi <alex.shi@...aro.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
	"daniel.lezcano@...aro.org" <daniel.lezcano@...aro.org>,
	"fweisbec@...il.com" <fweisbec@...il.com>,
	"linux@....linux.org.uk" <linux@....linux.org.uk>,
	"tony.luck@...el.com" <tony.luck@...el.com>,
	"fenghua.yu@...el.com" <fenghua.yu@...el.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"arjan@...ux.intel.com" <arjan@...ux.intel.com>,
	"pjt@...gle.com" <pjt@...gle.com>,
	"fengguang.wu@...el.com" <fengguang.wu@...el.com>,
	"james.hogan@...tec.com" <james.hogan@...tec.com>,
	"jason.low2@...com" <jason.low2@...com>,
	"gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
	"hanjun.guo@...aro.org" <hanjun.guo@...aro.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving

On Mon, Jan 06, 2014 at 01:35:39PM +0000, Alex Shi wrote:
> On 01/03/2014 12:04 AM, Morten Rasmussen wrote:
> > On Wed, Dec 25, 2013 at 02:58:26PM +0000, Alex Shi wrote:
> >>
> >>>> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
> >>>> From: Alex Shi <alex.shi@...aro.org>
> >>>> Date: Sat, 23 Nov 2013 23:18:09 +0800
> >>>> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
> >>>>
> >>>> Task migration happens when target just a bit less then source cpu load.
> >>>> To reduce such situation happens, aggravate the target cpu load with
> >>>> sd->imbalance_pct/100 in wake_affine.
> >>>>
> >>>> In find_idlest/busiest_group, change the aggravate to local cpu only
> >>>> from old group aggravation.
> >>>>
> >>>> on my pandaboard ES.
> >>>>
> >>>>    latest kernel 527d1511310a89            + whole patchset
> >>>> hackbench -T -g 10 -f 40
> >>>>    23.25"                                  21.99"
> >>>>    23.16"                                  21.20"
> >>>>    24.24"                                  21.89"
> >>>> hackbench -p -g 10 -f 40
> >>>>    26.52"                                  21.46"
> >>>>    23.89"                                  22.96"
> >>>>    25.65"                                  22.73"
> >>>> hackbench -P -g 10 -f 40
> >>>>    20.14"                                  19.72"
> >>>>    19.96"                                  19.10"
> >>>>    21.76"                                  20.03"
> >>>>
> >>>> Signed-off-by: Alex Shi <alex.shi@...aro.org>
> >>>> ---
> >>>>  kernel/sched/fair.c | 35 ++++++++++++++++-------------------
> >>>>  1 file changed, 16 insertions(+), 19 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index bccdd89..3623ba4 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
> >>>>
> >>>>  static unsigned long weighted_cpuload(const int cpu);
> >>>>  static unsigned long source_load(int cpu);
> >>>> -static unsigned long target_load(int cpu);
> >>>> +static unsigned long target_load(int cpu, int imbalance_pct);
> >>>>  static unsigned long power_of(int cpu);
> >>>>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> >>>>
> >>>> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
> >>>>   * Return a high guess at the load of a migration-target cpu weighted
> >>>>   * according to the scheduling class and "nice" value.
> >>>>   */
> >>>> -static unsigned long target_load(int cpu)
> >>>> +static unsigned long target_load(int cpu, int imbalance_pct)
> >>>>  {
> >>>>    struct rq *rq = cpu_rq(cpu);
> >>>>    unsigned long total = weighted_cpuload(cpu);
> >>>>
> >>>> +  /*
> >>>> +   * without cpu_load decay, in most of time cpu_load is same as total
> >>>> +   * so we need to make target a bit heavier to reduce task migration
> >>>> +   */
> >>>> +  total = total * imbalance_pct / 100;
> >>>> +
> >>>>    if (!sched_feat(LB_BIAS))
> >>>>            return total;
> >>>>
> >>>> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >>>>    this_cpu  = smp_processor_id();
> >>>>    prev_cpu  = task_cpu(p);
> >>>>    load      = source_load(prev_cpu);
> >>>> -  this_load = target_load(this_cpu);
> >>>> +  this_load = target_load(this_cpu, 100);
> >>>>
> >>>>    /*
> >>>>     * If sync wakeup then subtract the (maximum possible)
> >>>> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >>>>
> >>>>    if (balanced ||
> >>>>        (this_load <= load &&
> >>>> -       this_load + target_load(prev_cpu) <= tl_per_task)) {
> >>>> +       this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
> >>>>            /*
> >>>>             * This domain has SD_WAKE_AFFINE and
> >>>>             * p is cache cold in this domain, and
> >>>> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>>  {
> >>>>    struct sched_group *idlest = NULL, *group = sd->groups;
> >>>>    unsigned long min_load = ULONG_MAX, this_load = 0;
> >>>> -  int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >>>>
> >>>>    do {
> >>>>            unsigned long load, avg_load;
> >>>> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>>
> >>>>            for_each_cpu(i, sched_group_cpus(group)) {
> >>>>                    /* Bias balancing toward cpus of our domain */
> >>>> -                  if (local_group)
> >>>> +                  if (i == this_cpu)
> >>>
> >>> What is the motivation for changing the local_group load calculation?
> >>> Now the load contributions of all cpus in the local group, except
> >>> this_cpu, will contribute more as their contribution (this_load) is
> >>> determined using target_load() instead.
> >>
> >> This part code 147cbb4bbe99, written in 2005 for x86, at that time, only
> >> 2 cores(guess no HT at that time) in cpu socket. With the cores number
> >
> > NUMA support was already present. I guess that means support for systems
> > with significantly more than two cpus.
> 
> Thanks a lot for comments, Morten!
> the find_idlest_group used in select_task_rq_fair, that designed to
> select cpu near by old cpu whenever for wake or fork tasks. So in common
> case, the selected cpu is rarely on another NUMA.

Yes. My point was just that there was support for more than two cpus
when this code was introduce, and group could therefore have more than
two cpus in them.

> >
> >> increasing trend, the sched_group become large and large, to give whole
> >> group this bias value is becoming non-sense. So it looks reasonable to
> >> just bias this cpu only.
> >
> > I think that is question whether you want to bias the whole group or
> > not at wake-up balancing. If you don't change the bias weight and only
> > apply it to this_cpu, you will be more prone to send the waking task
> > somewhere else.
> 
> We don't need to follow the exactly old logical. Guess people will be
> more happier to see the code/logical simple with better benchmark
> performance. :)

I'm not against changes, especially if they make things simpler or make
more sense. But IMHO, that is not the same as just removing code and
showing benchmark improvements. There must be an explaination of why the
new code works better and the resulting code needs to make sense. 

> >
> >>>
> >>> If I'm not mistaken, that will lead to more frequent load balancing as
> >>> the local_group bias has been reduced. That is the opposite of your
> >>> intentions based on your comment in target_load().
> >>>
> >>>>                            load = source_load(i);
> >>>>                    else
> >>>> -                          load = target_load(i);
> >>>> +                          load = target_load(i, sd->imbalance_pct);
> >>>
> >>> You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
> >>> that you removed above. sd->imbalance_pct may have been arbitrarily
> >>> chosen in the past, but changing it may affect behavior.
> >>
> >> the first commit with this line:
> >> int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >> is 147cbb4bbe99, that also has no explanation for this imbalance value.
> >
> > 100 + (sd->imbalance_pct-100)/2 is used several places in sched.c before
> > 147cbb4bbe99. It seems that it was used to encourage balancing in
> > certain situations such as wake-up, while sd->imbalance_pct was used
> > unmodified in other situations.
> >
> > That seems to still be the case in v3.13-rc6. find_busiest_group() and
> > task_numa_compare() use sd->imbalance_pct, while task_numa_migrate(),
> > wake_affine(), and find_idlest_group() use 100 + (sd->imbalance_pct -
> > 100)/2. The pattern seems to be to use the reduced imbalance_pct for
> > wake-up and the full imbalance_pct for balancing of runnable tasks.
> >
> > AFAICT, using sd->imbalance_pct in find_idlest_group() will increase the
> > bias towards the local group. However, I would not be comfortable
> > changing it without understanding the full implications.
> >
> >> AFAIK, at that time, maybe only 2 LCPU in one socket, so, guess it is
> >> reason to use half of imbalance_pct.
> >
> > Based on the comment removed in 147cbb4bbe99 it seems that the reason is
> > to encourage balancing at wake-up. It doesn't seem to have any relation
> > to the number of cpus.
> >
> >> but sched_domain is used widely for many kinds of platform/cpu type, the
> >> value is arbitrary too.
> >> Another reason to use it, I scaled any cpu load which is not this_cpu,
> >> so a bit conservation is to use origin imbalance_pct, not half of it.
> >
> > The net result is almost the same if the groups have two cpus each.
> > Stronger for groups with one cpu, and weaker for groups with more than
> > two cpus.
> >
> >>>
> >>>>
> >>>>                    avg_load += load;
> >>>>            }
> >>>> @@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>>            }
> >>>>    } while (group = group->next, group != sd->groups);
> >>>>
> >>>> -  if (!idlest || 100*this_load < imbalance*min_load)
> >>>> +  if (!idlest || this_load < min_load)
> >>>>            return NULL;
> >>>>    return idlest;
> >>>>  }
> >>>> @@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>>>
> >>>>            nr_running = rq->nr_running;
> >>>>
> >>>> -          /* Bias balancing toward cpus of our domain */
> >>>> -          if (local_group)
> >>>> -                  load = target_load(i);
> >>>> +          /* Bias balancing toward dst cpu */
> >>>> +          if (env->dst_cpu == i)
> >>>> +                  load = target_load(i, env->sd->imbalance_pct);
> >>>
> >>> Here you do the same group load bias change as above.
> >>>
> >
> > I don't think it is right to change the bias here to only apply to the
> > dst_cpu. rebalance_domains()/load_balance() is balancing groups not just
> > pulling task to dst_cpu. It must pull enough tasks to balance the
> > groups, even if it means temporarily overloading dst_cpu. The bias
> > should therefore apply to the whole group.
> 
> I have different understanding for load_balance(). It the dst_cpu
> overloaded, other cpu need to move back some load. That cause more
> expensive task CS.

My understanding is that should_we_balance() decides which cpu is
eligible for doing the load balancing for a given domain (and the
domains above). That is, only one cpu in a group is allowed to load
balance between the local group and other groups. That cpu would
therefore be reponsible for pulling enough load that the groups are
balanced even if it means temporarily overloading itself. The other cpus
in the group will take care of load balancing the extra load within the
local group later.

If the balance cpu didn't overload itself it wouldn't always be able to
balance the groups, especially for large groups.

If my understanding is right, I think it is wrong to not bias the entire
target group.

> >
> >>>>            else
> >>>>                    load = source_load(i);
> >>>>
> >>>> @@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >>>>            if ((local->idle_cpus < busiest->idle_cpus) &&
> >>>>                busiest->sum_nr_running <= busiest->group_weight)
> >>>>                    goto out_balanced;
> >>>> -  } else {
> >>>> -          /*
> >>>> -           * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> >>>> -           * imbalance_pct to be conservative.
> >>>> -           */
> >>>> -          if (100 * busiest->avg_load <=
> >>>> -                          env->sd->imbalance_pct * local->avg_load)
> >>>> -                  goto out_balanced;
> >>>>    }
> >>>>
> >>>>  force_balance:
> >>>
> >>> As said my previous replies to this series, I think this problem should
> >>> be solved by fixing the cause of the problem, that is the cpu_load
> >>> calculation, instead of biasing the cpu_load where-ever it is used to
> >>> hide the problem.
> >>>
> >>> Doing a bit of git archaeology reveals that the cpu_load code goes back
> >>> to 7897986bad8f6cd50d6149345aca7f6480f49464 and that in the original
> >>> patch x86 was using *_idx > 0 for all sched_domain levels except SMT. In
> >>> my opinion that made logical sense. If we are about to change to *_idx=0
> >>> we are removing the main idea behind that code and there needs to be a
> >>> new one. Otherwise, cpu_load doesn't make sense.
> >>
> >> The commit is written in 2005. The CFS scheduler merged into kernel in
> >> Oct 2007. it is a too old legacy for us ...
> >
> > And that is why I think cpu_load should be reconsidered. It doesn't make
> > sense to cripple the cpu_load code without fully understanding what it
> > is doing and then try to hide the problems it causes.
> 
> I understand your concern, Morten.
> But if no one explain well what these code for, it means it run out of
> control, and should be out of kernel.

Fully agree. That is why I think all the users of cpu_load must be
considered as well. I doesn't make sense to rip out most of the cpu_load
code and change its behaviour without considering the code that uses
cpu_load.

I like the idea of cleaning up the cpu_load stuff, but it must be
replaced with something that makes sense. Otherwise, we just save the
problems for later and carry around unexplainable code until it happens.

> On the contrary, the new code has
> the better performance both on arm and x86 -- I'd like take fengguang's
> testing result as advantages, since the task migration and CS reduced.
> So, consider above reasons, I don't know how to refuse a better
> performance code, but keep the unclean legacy.

I may have missed something, but I don't understand the reason for the
performance improvements that you are reporting. I see better numbers
for a few benchmarks, but I still don't understand why the code makes
sense after the cleanup. If we don't understand why it works, we cannot
be sure that it doesn't harm other benchmarks. There is always a chance
that we miss something but, IMHO, not having any idea to begin with
increases the chances for problems later significantly. So why not get
to the bottom of the problem of cleaning up cpu_load?

Have you done more extensive benchmarking? Have you seen any regressions
in other benchmarks?

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/