linux-kernel - Re: wake_wide mechanism clarification

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJWu+op3H+sb65m2qm6fCwHcNdGmPGpkML9z0X8nquH5zxyn4Q@mail.gmail.com>
Date:   Thu, 3 Aug 2017 16:48:53 -0700
From:   Joel Fernandes <joelaf@...gle.com>
To:     Michael Wang <yun.wang@...fitbricks.com>
Cc:     Mike Galbraith <umgwanakikbuti@...il.com>,
        Josef Bacik <josef@...icpanda.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Juri Lelli <Juri.Lelli@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Brendan Jackman <brendan.jackman@....com>,
        Chris Redpath <Chris.Redpath@....com>,
        Matt Fleming <matt@...eblueprint.co.uk>
Subject: Re: wake_wide mechanism clarification

Hi Michael,

Thanks for your reply.

On Wed, Aug 2, 2017 at 1:26 AM, Michael Wang <yun.wang@...fitbricks.com> wrote:
> Hi, Joel
>
> On 07/29/2017 10:13 AM, Joel Fernandes wrote:
>> +Michael Wang on his current email address (old one bounced). (my
>> reply was to Mike Galbraith but I also meant to CC Michael Wang for
>> the discussion). Thanks
>
> Just back from vacation and saw this long long discussion...
>
> I think guys explained well on the idea, wake_wide() just try to filter
> out the cases that may congest the waker's domain, as much as possible
> without side effect (ideally).
>
> The factor at very beginning is just a static number which picked by
> enormous times of practice testing, Peter Zijlstr suggest we add the
> domain size and make it a flexible factor, which by practice not that bad.

Yeah making it flexible might be better.

>
> So the simple answer is we use the llc_size since no better option at
> that time :-)

:-)

>
> But things changing very fast and new feature can introduce new cases,
> I'm pretty sure if we redo the testing the results will be very different,
> however, the idea itself still make sense to me, at least on theory.
>
> Recently I'm also thinking about the scheduler issue, cfs try to find out
> general solution for all these cases and the best answer is obviously, all
> the cases will suffer some damage and scheduler itself bloated to achieve
> the goal 'taking care of all'.

My team is definitely seeing some weirdness with wake_wide() itself
not working correctly. We're collecting more data and do a more
thorough analysis of the behavior. I think it can be improved. Will
update soon as we have something.

thanks,

-Joel

>
> Regards,
> Michael Wang
>
>
>
>>
>> On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joelaf@...gle.com> wrote:
>>> Hi Mike,
>>>
>>> I have take spent some time understanding the email thread and
>>> previous discussions. Unfortunately the second condition we are
>>> checking for in the wake_wide still didn't make sense to me (mentioned
>>> below) :-(
>>>
>>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
>>> <umgwanakikbuti@...il.com> wrote:
>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>>
>>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>>> llc_size instead of something else for the multiplication with slave
>>>>>> (slave * factor).
>>>>
>>>>> Yeah I don't know why llc_size was chosen...
>>>>
>>>> static void update_top_cache_domain(int cpu)
>>>> {
>>>>         struct sched_domain_shared *sds = NULL;
>>>>         struct sched_domain *sd;
>>>>         int id = cpu;
>>>>         int size = 1;
>>>>
>>>>         sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>>>         if (sd) {
>>>>                 id = cpumask_first(sched_domain_span(sd));
>>>>                 size = cpumask_weight(sched_domain_span(sd));
>>>>                 sds = sd->shared;
>>>>         }
>>>>
>>>>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>>         per_cpu(sd_llc_size, cpu) = size;
>>>>
>>>> The goal of wake wide was to approximate when pulling would be a futile
>>>> consolidation effort and counterproductive to scaling.  'course with
>>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>>> out of CPU for its one and only self (slamming into scaling wall)
>>>> before it needing to turn its minions loose to conquer the world.
>>>
>>> Actually the original question was why do we have the second condition
>>> as "master < slave * factor", instead of "master < factor". that's
>>> what didn't make sense to me. Why don't we return 0 from wake_wide if
>>> master < factor ?
>>>
>>> Infact, as the factor is set to the llc_size, I think the condition
>>> that makes sense to me is:
>>>
>>> if ((master + slave) < llc_size)
>>>   return 0;
>>>
>>> In other words, if the master flips and the slave flips are totally
>>> higher than the llc_size, then we are most likely waking up too many
>>> tasks as affine and should then switch to wide to prevent overloading.
>>>
>>> Digging further into the original patch from Michael Wang (I also CC'd
>>> him), this was the code (before you had changed it to master/slave):
>>>
>>> wakee->nr_wakee_switch > factor &&
>>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>>
>>> To explain the second condition above, Michael Wang said the following in [1]
>>>
>>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>>> tasks rely on it, then waker's higher latency will damage all of them, pull
>>> wakee seems to be a bad deal."
>>>
>>> Again I didn't follow why the second condition couldn't just be:
>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>>> Micheal Wang that I quoted.
>>> and why he's instead doing the whole multiplication thing there that I
>>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>>>
>>> Rephrasing my question in another way, why are we talking the ratio of
>>> master/slave instead of the sum when comparing if its > factor? I am
>>> surely missing something here.
>>>
>>> Just taking an example:
>>>
>>> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
>>> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
>>> between all 3 masters. Also to make it a bit more interesting, let s8
>>> wake up some random task T0. A diagram to show the master/slave
>>> relation ships could look like:
>>>
>>>                                     +-----+
>>>                                     |     |
>>> +------------------------+   +------+ M2  |
>>> |                        |   |      |     |
>>> |        M1              |   |      +--+--+----
>>> |                        |   |      |  |  |   |
>>> |                        |   |      |  |  |   s15
>>> +--+--+--+--+--+--+---+--+---+      v  v  v
>>>    |  |  |  |  |  |   |      |      s9 s10 s11
>>>    v  v  v  v  v  v   v      v
>>>    s1 s2 s3 s4 s5 s6  s7     s8 ---> T0
>>>                              ^
>>>                              |
>>>                            +-+---+
>>>                            |     |
>>>                            | M3  |
>>>                            |     |
>>>                            +--+--+-----
>>>                            |  |  |    |
>>>                            v  v  v    v
>>>                           s12 s13 s14 s16
>>>
>>>
>>> Lets consider the case of M1 waking up s8. As per the above diagram,
>>> M1 has 8 flips and s8 has 4 flips.
>>>
>>> With llc_size = 3, the condition
>>>
>>> (slave < factor) would return FALSE, so then we would turn to the
>>> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
>>> so wake_wide would return 0 and would cause s8 to be woken up as
>>> affine with relation to M1's core.
>>>
>>> So basically, it seems the heuristic is saying (with help of the
>>> second condition - master < slave * factor). that Its a good idea for
>>> s8 to be affine-woken-up with respect to M1's core. Why is it a good
>>> idea to do that? It seems to me M1 has already several tasks its
>>> waking as affine so causing s8 to be woken up affine could be harmful
>>> and it may be a better choice to wake it up elsewhere.
>>>
>>> Thanks for your help!
>>>
>>> -Joel
>>>
>>> [1] https://lkml.org/lkml/2013/7/4/20
>>>
>>>
>>>>
>>>> Something else to consider: network interrupt waking multiple workers
>>>> at high frequency.  If the waking CPU is idle, do you really want to
>>>> place a worker directly in front of a tattoo artist, or is it better
>>>> off nearly anywhere but there?
>>>>
>>>> If the box is virtual, with no topology exposed (or real but ancient)
>>>> to let select_idle_sibling() come to the rescue, two workers can even
>>>> get tattooed simultaneously (see sync wakeup).
>>>>
>>>>         -Mike