lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50FF52A2.9010604@linux.vnet.ibm.com>
Date:	Wed, 23 Jan 2013 11:01:54 +0800
From:	Michael Wang <wangyun@...ux.vnet.ibm.com>
To:	Mike Galbraith <bitbucket@...ine.de>
CC:	linux-kernel@...r.kernel.org, mingo@...hat.com,
	peterz@...radead.org, mingo@...nel.org, a.p.zijlstra@...llo.nl
Subject: Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()

On 01/22/2013 07:34 PM, Mike Galbraith wrote:
> On Tue, 2013-01-22 at 16:56 +0800, Michael Wang wrote: 
>> On 01/22/2013 04:03 PM, Mike Galbraith wrote:
>> [snip]
>>> ... 
>>>>>
>>>>> That was with your change backed out, and the q/d below applied.
>>>>
>>>> So that change will help to solve the issue? good to know :)
>>>>
>>>> But it will invoke wake_affine() with out any delay, the benefit
>>>> of the patch set will be reduced a lot...
>>>
>>> Yeah, I used size large hammer.
>>>
>>>> I think this change help to solve the issue because it avoid jump
>>>> into balance path when wakeup for any cases, I think we can do
>>>> some change like below to achieve this and meanwhile gain benefit
>>>> from delay wake_affine().
>>>
>>> Yup, I killed it all the way dead.  I'll see what below does.
>>>
>>> I don't really see the point of the wake_affine() change in this set
>>> though.  Its purpose is to decide if a pull is ok or not.  If we don't
>>> need its opinion when we look for an (momentarily?) idle core in
>>> this_domain, we shouldn't need it at all, and could just delete it.
>>
>> I have a question here, so wake_affine() is:
>> A. check whether it is balance to pull.
>> B. check whether it's better to pull than not.
> 
> A, "is it ok to move this guy to where red hot data awaits" is the way
> it has always been used, ie "can we move him here without upsetting
> balance too much", with sync hint meaning the waker is likely going to
> sleep very soon, so we pretend he's already gone when looking at load.
> That sync hint btw doesn't have anywhere near as much real meaning as
> would be nice to have.

Agree.

> 
>> I suppose it's A, so my logical is:
>> 1. find idle cpu in prev domain.
>> 2. if failed and affine, find idle cpu in current domain.
> 
> Hm.  If cpu and prev_cpu are cache affine, you already searched both.
> 

Well, it's true if affine cpus means their sd topology are always same,
but do we have a promise on it?

>> 3. if find idle cpu in current domain, check whether it is balance to
>> pull by wake_affine().
>> 4. if all failed, two choice, go to balance path or directly return
>> prev_cpu.
>>
>> So I still need wake_affine() for a final check, but to be honest, I
>> really doubt about whether it worth to care about balance while waking
>> up, if the task just run several ns then sleep again, it's totally
>> worthless...
> 
> Not totally worthless, but questionable yes.  It really matters most
> when wakee was way over there in cache foo for whatever reason, has no
> big footprint, and red hot data is waiting here in cache bar.  Light
> tasks migrating helps communicating buddies find each other and perform.
> 
> The new NUMA stuff will help heavy tasks, but it won't help with light
> tasks that could benefit by moving to the data.  We currently try to
> migrate on wakeup, if we do stop doing that, we may get hurt more often
> than not, dunno.  Benchmarks will tell.

Agree, for this patch set, before got the proof that do balance is worse
than not while waking up, I will choose do, well, I think the answer
won't be so easy, we need some number to show how a task is worth to do
balance while waking up, I even doubt whether it is possible to have
such number...

> 
>>> If we ever enter balance_path, we can't possibly induce imbalance without
>>> there being something broken in that path, no?
>>
>> So your opinion is, some thing broken in the new balance path?
> 
> I don't know that, but it is a logical bug candidate.

And you are right, I think I got some thing from the debug info you
showed, thanks again for that :)

> 
>>> BTW, it could well be that an unpatched kernel will collapse as well if
>>> WAKE_BALANCE is turned on.  I've never tried that on a largish box, as
>>> doing any of the wakeup time optional stuff used to make tbench scream.
>>>
>>>> Since the issue could not been reproduced on my side, I don't know
>>>> whether the patch benefit or not, so if you are willing to send out
>>>> a formal patch, I would be glad to include it in my patch set ;-)
>>>
>>> Just changing to scan prev_cpu before considering pulling would put a
>>> big dent in the bouncing cow problem, but that's the intriguing thing
>>> about this set.. 
>>
>> So that's my first question, if wake_affine() return 1 means it's better
>> to pull than not, then the new way may be harmful, but if it's just told
>> us, pull won't break the balance, then I still think, current domain is
>> just a backup, not the candidate of first choice.
> 
> wake_affine() doesn't know if it'll be a good or bad move, it only says
> go for it, load numbers are within parameters.
> 
>> can we have the tbench and pgbench big box gain without
>>> a lot of pain to go with it?  Small boxen will surely benefit, pretty
>>> much can't be hurt, but what about all those fast/light tasks that won't
>>> hop across nodes to red hot data?
>>
>> I don't get it... a task won't hop means a task always been selected to
>> run on prev_cpu?
> 
> Yes.  If wakee is light, has no large footprint to later have to drag to
> its new home, moving it to the hot data is a win.
> 
>> We will assign idle cpu if we found, but if not, we can use prev_cpu or
>> go to balance path and find one, so what's the problem here?
> 
> Wakeup latency may be low, but the task can still perform badly due to
> misses.  In the tbench case, cross node data misses aren't anywhere near
> as bad as the _everything_ is a miss you get from waker/wakee bouncing
> all over a single shared L3. 

Hmm...that seems like another point which could only be directed by
benchmarks.

Regards,
Michael Wang

> 
>>> No formal patch is likely to result from any testing I do atm at least.
>>> I'm testing your patches because I see potential, I really want it to
>>> work out, but have to see it do that with my own two beady eyeballs ;-)
>>
>> Got it.
>>
>>>
>>>> And another patch below below is a debug one, which will print out
>>>> all the sbm info, so we could check whether it was initialized
>>>> correctly, just use command "dmesg | grep WYT" to show the map.
>>
>> What about this patch? May be the wrong map is the killer on balance
>> path, should we check it? ;-)
> 
> Yeah,I haven't actually looked for any booboos, just ran it straight out
> of the box ;-)
> 
>> Regards,
>> Michael Wang
>>
>>>>
>>>> Regards,
>>>> Michael Wang
>>>>
>>>> ---
>>>>  kernel/sched/fair.c |   42 +++++++++++++++++++++++++-----------------
>>>>  1 files changed, 25 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 2aa26c1..4361333 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>>>  }
>>>>
>>>>  /*
>>>> - * Try and locate an idle CPU in the sched_domain.
>>>> + * Try and locate an idle CPU in the sched_domain, return -1 if failed.
>>>>   */
>>>>  static int select_idle_sibling(struct task_struct *p, int target)
>>>>  {
>>>> @@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>>>
>>>>                         target = cpumask_first_and(sched_group_cpus(sg),
>>>>                                         tsk_cpus_allowed(p));
>>>> -                       goto done;
>>>> +                       return target;
>>>>  next:
>>>>                         sg = sg->next;
>>>>                 } while (sg != sd->groups);
>>>>         }
>>>> -done:
>>>> -       return target;
>>>> +
>>>> +       return -1;
>>>>  }
>>>>
>>>>  /*
>>>> @@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>>>                  * may has already been cached on prev_cpu, and usually
>>>>                  * they require low latency.
>>>>                  *
>>>> -                * So firstly try to locate an idle cpu shared the cache
>>>> +                * Therefor, balance path in such case will cause damage
>>>> +                * and bring benefit synchronously, wakeup on prev_cpu
>>>> +                * may better than wakeup on a new lower load cpu for the
>>>> +                * cached memory, and we never know.
>>>> +                *
>>>> +                * So the principle is, try to find an idle cpu as close to
>>>> +                * prev_cpu as possible, if failed, just take prev_cpu.
>>>> +                *
>>>> +                * Firstly try to locate an idle cpu shared the cache
>>>>                  * with prev_cpu, it has the chance to break the load
>>>>                  * balance, fortunately, select_idle_sibling() will search
>>>>                  * from top to bottom, which help to reduce the chance in
>>>>                  * some cases.
>>>>                  */
>>>>                 new_cpu = select_idle_sibling(p, prev_cpu);
>>>> -               if (idle_cpu(new_cpu))
>>>> +               if (new_cpu != -1)
>>>>                         goto unlock;
>>>>
>>>>                 /*
>>>>                  * No idle cpu could be found in the topology of prev_cpu,
>>>> -                * before jump into the slow balance_path, try search again
>>>> -                * in the topology of current cpu if it is the affine of
>>>> -                * prev_cpu.
>>>> +                * before take the prev_cpu, try search again in the
>>>> +                * topology of current cpu if it is the affine of prev_cpu.
>>>>                  */
>>>> -               if (cpu == prev_cpu ||
>>>> -                               !sbm->affine_map[prev_cpu] ||
>>>> +               if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
>>>>                                 !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>>>> -                       goto balance_path;
>>>> +                       goto take_prev;
>>>>
>>>>                 new_cpu = select_idle_sibling(p, cpu);
>>>> -               if (!idle_cpu(new_cpu))
>>>> -                       goto balance_path;
>>>> -
>>>>                 /*
>>>>                  * Invoke wake_affine() finally since it is no doubt a
>>>>                  * performance killer.
>>>>                  */
>>>> -               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
>>>> +               if ((new_cpu != -1) &&
>>>> +                       wake_affine(sbm->affine_map[prev_cpu], p, sync))
>>>>                         goto unlock;
>>>> +
>>>> +take_prev:
>>>> +               new_cpu = prev_cpu;
>>>> +               goto unlock;
>>>>         }
>>>>
>>>> -balance_path:
>>>> +       /* Balance path. */
>>>>         new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
>>>>         sd = sbm->sd[type][sbm->top_level[type]];
>>>>
>>>> -- 
>>>> 1.7.4.1
>>>>
>>>> DEBUG PATCH:
>>>>
>>>> ---
>>>>  kernel/sched/core.c |   30 ++++++++++++++++++++++++++++++
>>>>  1 files changed, 30 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 0c63303..f251f29 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
>>>>  static int sbm_max_level;
>>>>  DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
>>>>
>>>> +static void debug_sched_balance_map(int cpu)
>>>> +{
>>>> +       int i, type, level = 0;
>>>> +       struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>>>> +
>>>> +       printk("WYT: sbm of cpu %d\n", cpu);
>>>> +
>>>> +       for (type = 0; type < SBM_MAX_TYPE; type++) {
>>>> +               if (type == SBM_EXEC_TYPE)
>>>> +                       printk("WYT: \t exec map\n");
>>>> +               else if (type == SBM_FORK_TYPE)
>>>> +                       printk("WYT: \t fork map\n");
>>>> +               else if (type == SBM_WAKE_TYPE)
>>>> +                       printk("WYT: \t wake map\n");
>>>> +
>>>> +               for (level = 0; level < sbm_max_level; level++) {
>>>> +                       if (sbm->sd[type][level])
>>>> +                               printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
>>>> +               }
>>>> +       }
>>>> +
>>>> +       printk("WYT: \t affine map\n");
>>>> +
>>>> +       for_each_possible_cpu(i) {
>>>> +               if (sbm->affine_map[i])
>>>> +                       printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
>>>> +       }
>>>> +}
>>>> +
>>>>  static void build_sched_balance_map(int cpu)
>>>>  {
>>>>         struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>>>> @@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>>>>          * destroy_sched_domains() already do the work.
>>>>          */
>>>>         build_sched_balance_map(cpu);
>>>> +       debug_sched_balance_map(cpu);
>>>>         rcu_assign_pointer(rq->sbm, sbm);
>>>>  }
>>>>
>>>
>>>
>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ