linux-kernel - Re: wake_wide mechanism clarification

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fb0b5802-10ab-3942-1cf1-6111523f0ec4@oracle.com>
Date:   Thu, 10 Aug 2017 12:41:01 -0500
From:   Atish Patra <atish.patra@...cle.com>
To:     Brendan Jackman <brendan.jackman@....com>
Cc:     Josef Bacik <josef@...icpanda.com>,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Juri Lelli <Juri.Lelli@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Chris Redpath <Chris.Redpath@....com>
Subject: Re: wake_wide mechanism clarification

On 08/10/2017 04:48 AM, Brendan Jackman wrote:
>
> On Wed, Aug 09 2017 at 21:22, Atish Patra wrote:
>> On 08/03/2017 10:05 AM, Brendan Jackman wrote:
>>>
>>> On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote:
>>>> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:
>>>>>> On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
>>>>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>>>>>
>>>>>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>>>>>> llc_size instead of something else for the multiplication with slave
>>>>>>>>> (slave * factor).
>>>>>>>
>>>>>>>> Yeah I don't know why llc_size was chosen...
>>>>>>>
>>>>>>> static void update_top_cache_domain(int cpu)
>>>>>>> {
>>>>>>> struct sched_domain_shared *sds = NULL;
>>>>>>> struct sched_domain *sd;
>>>>>>> int id = cpu;
>>>>>>> int size = 1;
>>>>>>>
>>>>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>>>>>> if (sd) {
>>>>>>> id = cpumask_first(sched_domain_span(sd));
>>>>>>> size = cpumask_weight(sched_domain_span(sd));
>>>>>>> sds = sd->shared;
>>>>>>> }
>>>>>>>
>>>>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>>>>> per_cpu(sd_llc_size, cpu) = size;
>>>>>>>
>>>>>>> The goal of wake wide was to approximate when pulling would be a futile
>>>>>>> consolidation effort and counterproductive to scaling. 'course with
>>>>>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>>>>>> out of CPU for its one and only self (slamming into scaling wall)
>>>>>>> before it needing to turn its minions loose to conquer the world.
>>>>>>>
>>>>>>> Something else to consider: network interrupt waking multiple workers
>>>>>>> at high frequency. If the waking CPU is idle, do you really want to
>>>>>>> place a worker directly in front of a tattoo artist, or is it better
>>>>>>> off nearly anywhere but there?
>>>>>>>
>>>>>>> If the box is virtual, with no topology exposed (or real but ancient)
>>>>>>> to let select_idle_sibling() come to the rescue, two workers can even
>>>>>>> get tattooed simultaneously (see sync wakeup).
>>>>>>>
>>>>>>
>>>>>> Heuristics are hard, news at 11.  I think messing with wake_wide() itself is too
>>>>>> big of a hammer, we probably need a middle ground.  I'm messing with it right
>>>>>> now so it's too early to say for sure, but i _suspect_ the bigger latencies we
>>>>>> see are not because we overload the cpu we're trying to pull to, but because
>>>>>> when we fail to do the wake_affine() we only look at siblings of the affine_sd
>>>>>> instead of doing the full "find the idlest cpu in the land!" thing.
>>>>>
>>>>> This is the problem I've been hitting lately. My use case is 1 task per
>>>>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
>>>>> task per CPU, they all do X amount of work then pthread_barrier_wait
>>>>> (i.e. sleep until the last task finishes its X and hits the barrier). On
>>>>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then
>>>>> those CPUs pull over the tasks that are still running:
>>>>>
>>>>>      v CPU v           ->time->
>>>>>
>>>>>                     -------------
>>>>>    0  (big)         11111  /333
>>>>>                     -------------
>>>>>    1  (big)         22222   /444|
>>>>>                     -------------
>>>>>    2  (LITTLE)      333333/
>>>>>                     -------------
>>>>>    3  (LITTLE)      444444/
>>>>>                     -------------
>>>>>
>>>>> Now when task 4 hits the barrier (at |) and wakes the others up, there
>>>>> are 4 tasks with prev_cpu=<big> and 0 tasks with
>>>>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
>>>>> regardless of wake_affine, want_affine means that we'll only look in
>>>>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
>>>>> bigs until the next load balance, something like this:
>>>>>
>>>>>      v CPU v           ->time->
>>>>>
>>>>>                     ------------------------
>>>>>    0  (big)         11111  /333  31313\33333
>>>>>                     ------------------------
>>>>>    1  (big)         22222   /444|424\4444444
>>>>>                     ------------------------
>>>>>    2  (LITTLE)      333333/          \222222
>>>>>                     ------------------------
>>>>>    3  (LITTLE)      444444/            \1111
>>>>>                     ------------------------
>>>>>                                  ^^^
>>>>>                            underutilization
>>>>>
>>>>>> I _think_
>>>>>> the answer is to make select_idle_sibling() try less hard to find something
>>>>>> workable and only use obviously idle cpu's in the affine sd, and fall back to
>>>>>> the full load balance esque search.
>>>>>
>>>>> So this idea of allowing select_idle_sibling to fail, and falling back
>>>>> to the slow path, would help me too, I think.
>>>>
>>>> Unfortunately this statement of mine was wrong, I had it in my head that we
>>>> would fall back to a find the idlest cpu thing provided we failed to wake
>>>> affine, but we just do select_idle_sibling() and expect the load balancer to
>>>> move things around as needed.
>>>
>>> Ah yes, when wake_affine() returns false, we still do
>>> select_idle_sibling (except in prev_cpu's sd_llc instead of
>>> smp_processor_id()'s), and that is the problem faced by my workload. I
>>> thought you were suggesting to change the flow so that
>>> select_idle_sibling can say "I didn't find any idle siblings - go to the
>>> find_idlest_group path".
>>>
>>>>> This is also why I was playing with your
>>>>> don't-affine-recently-balanced-tasks patch[1], which also helps my case
>>>>> since it prevents want_affine for tasks 3 and 4 (which were recently
>>>>> moved by an active balance).
>>>>>
>>>>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
>>>>>     (also linked elsewhere in this thread)
>>>>>
>>>>
>>>> Would you try peter's sched/experimental branch and see how that affects your
>>>> workload?  I'm still messing with my patches and I may drop this one as it now
>>>> appears to be too aggressive with the new set of patches.  Thanks,
>>>
>>> Sure, I'll take a look at those, thanks. I guess the idea of caching
>>> values in LB and then using them in wakeup[2] is a lighter-handed way of
>>> achieving the same thing as last_balance_ts? It won't solve my problem
>>> directly since we'll still only look in sd_llc, but I think it could be
>>> a basis for a way to say "go find_idlest_group path on these tasks" at
>>> the beginning of select_task_rq_fair.
>>>
>>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef
>>>
>>> Thanks,
>>> Brendan
>>>
>>
>> Would it be better if it searches for idle cpus in next higher domain
>> (if that is not NUMA) instead of doing find_idlest_group ?
>>
>> The above approach [2] will still try to search a idle cpu in the llc
>> domain of new_cpu. If it finds one great. Otherwise, let's search for an
>> idle cpu in next higher domain(if that is not NUMA) excluding the
>> current domain which we have already searched. I think it would help in
>> cases where LLC < NUMA.
>
> Well for my use case, "search for idle cpus in the next higher domain"
> and "go to find_busiest_group & co" mean the same thing.
Yeah. It just may be tad cheaper to do idle cpu search than to 
find_idlest_group and find_idlest_cpu.

But for NUMA
> boxes (where I'm guessing there are >=3 sched_domains?)
Yeah.
then yeah that
> sounds like a good idea at least to my inexpert ears. Currently it's
> either "only look at siblings of either prev_cpu or smp_processor_id()"
> or "look through the whole system (or at least as far as sd_flag takes
> us)", so maybe you want middle ground?
>
> On the other hand, I'm guessing remote wakeups across NUMA nodes are
> expensive, so that's why we only look at siblings?
Correct. But I am suggesting the above modification impacting only for 
archs where LLC < NUMA i.e. you have SMT, LLC, DIE/CPU, NUMA.. domains.
For your case, I guess the domains would stop at CPU. Any architecture 
with above domain hierarchy can search wide to find out an idle cpu.

Other architectures where NUMA is the next domain to LLC, it won't do
anything.

There's already a
> choice between prev_cpu's and smp_processor_id()'s LLC siblings.
>
Yes. So we may benefit by searching idle cpus other LLC groups as well.
Initially, I was worried about it may be too expensive to search wide
but Peter's recent patch (1ad3aaf3fcd2:Implement new approach to scale 
select_idle_cpu) should abandon the search if it becomes too costly.

Did I miss any case it may affect performance negatively ?

Regards,
Atish