linux-kernel - Re: wake_wide mechanism clarification

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJWu+oo31Sv1j=5SK4aBO9jASKs-h10VwFMQ3y+Pwy6B5hdm8w@mail.gmail.com>
Date:   Sat, 29 Jul 2017 01:13:43 -0700
From:   Joel Fernandes <joelaf@...gle.com>
To:     Mike Galbraith <umgwanakikbuti@...il.com>
Cc:     Josef Bacik <josef@...icpanda.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Juri Lelli <Juri.Lelli@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Brendan Jackman <brendan.jackman@....com>,
        Chris Redpath <Chris.Redpath@....com>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Michael Wang <yun.wang@...fitbricks.com>
Subject: Re: wake_wide mechanism clarification

+Michael Wang on his current email address (old one bounced). (my
reply was to Mike Galbraith but I also meant to CC Michael Wang for
the discussion). Thanks

On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joelaf@...gle.com> wrote:
> Hi Mike,
>
> I have take spent some time understanding the email thread and
> previous discussions. Unfortunately the second condition we are
> checking for in the wake_wide still didn't make sense to me (mentioned
> below) :-(
>
> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
> <umgwanakikbuti@...il.com> wrote:
>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>
>>> > That makes sense that we multiply slave's flips by a factor because
>>> > its low, but I still didn't get why the factor is chosen to be
>>> > llc_size instead of something else for the multiplication with slave
>>> > (slave * factor).
>>
>>> Yeah I don't know why llc_size was chosen...
>>
>> static void update_top_cache_domain(int cpu)
>> {
>>         struct sched_domain_shared *sds = NULL;
>>         struct sched_domain *sd;
>>         int id = cpu;
>>         int size = 1;
>>
>>         sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>         if (sd) {
>>                 id = cpumask_first(sched_domain_span(sd));
>>                 size = cpumask_weight(sched_domain_span(sd));
>>                 sds = sd->shared;
>>         }
>>
>>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>         per_cpu(sd_llc_size, cpu) = size;
>>
>> The goal of wake wide was to approximate when pulling would be a futile
>> consolidation effort and counterproductive to scaling.  'course with
>> ever increasing socket size, any 1:N waker is ever more likely to run
>> out of CPU for its one and only self (slamming into scaling wall)
>> before it needing to turn its minions loose to conquer the world.
>
> Actually the original question was why do we have the second condition
> as "master < slave * factor", instead of "master < factor". that's
> what didn't make sense to me. Why don't we return 0 from wake_wide if
> master < factor ?
>
> Infact, as the factor is set to the llc_size, I think the condition
> that makes sense to me is:
>
> if ((master + slave) < llc_size)
>   return 0;
>
> In other words, if the master flips and the slave flips are totally
> higher than the llc_size, then we are most likely waking up too many
> tasks as affine and should then switch to wide to prevent overloading.
>
> Digging further into the original patch from Michael Wang (I also CC'd
> him), this was the code (before you had changed it to master/slave):
>
> wakee->nr_wakee_switch > factor &&
> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>
> To explain the second condition above, Michael Wang said the following in [1]
>
> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
> tasks rely on it, then waker's higher latency will damage all of them, pull
> wakee seems to be a bad deal."
>
> Again I didn't follow why the second condition couldn't just be:
> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
> wakee->nr_wakee_switch) > factor, based on the above explanation from
> Micheal Wang that I quoted.
> and why he's instead doing the whole multiplication thing there that I
> was talking about earlier: "factor * wakee->nr_wakee_switch".
>
> Rephrasing my question in another way, why are we talking the ratio of
> master/slave instead of the sum when comparing if its > factor? I am
> surely missing something here.
>
> Just taking an example:
>
> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
> between all 3 masters. Also to make it a bit more interesting, let s8
> wake up some random task T0. A diagram to show the master/slave
> relation ships could look like:
>
>                                     +-----+
>                                     |     |
> +------------------------+   +------+ M2  |
> |                        |   |      |     |
> |        M1              |   |      +--+--+----
> |                        |   |      |  |  |   |
> |                        |   |      |  |  |   s15
> +--+--+--+--+--+--+---+--+---+      v  v  v
>    |  |  |  |  |  |   |      |      s9 s10 s11
>    v  v  v  v  v  v   v      v
>    s1 s2 s3 s4 s5 s6  s7     s8 ---> T0
>                              ^
>                              |
>                            +-+---+
>                            |     |
>                            | M3  |
>                            |     |
>                            +--+--+-----
>                            |  |  |    |
>                            v  v  v    v
>                           s12 s13 s14 s16
>
>
> Lets consider the case of M1 waking up s8. As per the above diagram,
> M1 has 8 flips and s8 has 4 flips.
>
> With llc_size = 3, the condition
>
> (slave < factor) would return FALSE, so then we would turn to the
> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
> so wake_wide would return 0 and would cause s8 to be woken up as
> affine with relation to M1's core.
>
> So basically, it seems the heuristic is saying (with help of the
> second condition - master < slave * factor). that Its a good idea for
> s8 to be affine-woken-up with respect to M1's core. Why is it a good
> idea to do that? It seems to me M1 has already several tasks its
> waking as affine so causing s8 to be woken up affine could be harmful
> and it may be a better choice to wake it up elsewhere.
>
> Thanks for your help!
>
> -Joel
>
> [1] https://lkml.org/lkml/2013/7/4/20
>
>
>>
>> Something else to consider: network interrupt waking multiple workers
>> at high frequency.  If the waking CPU is idle, do you really want to
>> place a worker directly in front of a tattoo artist, or is it better
>> off nearly anywhere but there?
>>
>> If the box is virtual, with no topology exposed (or real but ancient)
>> to let select_idle_sibling() come to the rescue, two workers can even
>> get tattooed simultaneously (see sync wakeup).
>>
>>         -Mike