linux-kernel - Re: wake_wide mechanism clarification

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <307308e7-fef2-2b3c-baa9-a96ec9ed2b8d@profitbricks.com>
Date:   Wed, 2 Aug 2017 10:26:28 +0200
From:   Michael Wang <yun.wang@...fitbricks.com>
To:     Joel Fernandes <joelaf@...gle.com>,
        Mike Galbraith <umgwanakikbuti@...il.com>
Cc:     Josef Bacik <josef@...icpanda.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Juri Lelli <Juri.Lelli@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Brendan Jackman <brendan.jackman@....com>,
        Chris Redpath <Chris.Redpath@....com>,
        Matt Fleming <matt@...eblueprint.co.uk>
Subject: Re: wake_wide mechanism clarification

Hi, Joel

On 07/29/2017 10:13 AM, Joel Fernandes wrote:
> +Michael Wang on his current email address (old one bounced). (my
> reply was to Mike Galbraith but I also meant to CC Michael Wang for
> the discussion). Thanks

Just back from vacation and saw this long long discussion...

I think guys explained well on the idea, wake_wide() just try to filter
out the cases that may congest the waker's domain, as much as possible
without side effect (ideally).

The factor at very beginning is just a static number which picked by
enormous times of practice testing, Peter Zijlstr suggest we add the
domain size and make it a flexible factor, which by practice not that bad.

So the simple answer is we use the llc_size since no better option at
that time :-)

But things changing very fast and new feature can introduce new cases,
I'm pretty sure if we redo the testing the results will be very different,
however, the idea itself still make sense to me, at least on theory.

Recently I'm also thinking about the scheduler issue, cfs try to find out
general solution for all these cases and the best answer is obviously, all
the cases will suffer some damage and scheduler itself bloated to achieve
the goal 'taking care of all'.

So in order to achieve the maximum performance of particular workload, some
user defined scheduler would be an interesting idea :-P

Regards,
Michael Wang



> 
> On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joelaf@...gle.com> wrote:
>> Hi Mike,
>>
>> I have take spent some time understanding the email thread and
>> previous discussions. Unfortunately the second condition we are
>> checking for in the wake_wide still didn't make sense to me (mentioned
>> below) :-(
>>
>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
>> <umgwanakikbuti@...il.com> wrote:
>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>
>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>> llc_size instead of something else for the multiplication with slave
>>>>> (slave * factor).
>>>
>>>> Yeah I don't know why llc_size was chosen...
>>>
>>> static void update_top_cache_domain(int cpu)
>>> {
>>>         struct sched_domain_shared *sds = NULL;
>>>         struct sched_domain *sd;
>>>         int id = cpu;
>>>         int size = 1;
>>>
>>>         sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>>         if (sd) {
>>>                 id = cpumask_first(sched_domain_span(sd));
>>>                 size = cpumask_weight(sched_domain_span(sd));
>>>                 sds = sd->shared;
>>>         }
>>>
>>>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>         per_cpu(sd_llc_size, cpu) = size;
>>>
>>> The goal of wake wide was to approximate when pulling would be a futile
>>> consolidation effort and counterproductive to scaling.  'course with
>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>> out of CPU for its one and only self (slamming into scaling wall)
>>> before it needing to turn its minions loose to conquer the world.
>>
>> Actually the original question was why do we have the second condition
>> as "master < slave * factor", instead of "master < factor". that's
>> what didn't make sense to me. Why don't we return 0 from wake_wide if
>> master < factor ?
>>
>> Infact, as the factor is set to the llc_size, I think the condition
>> that makes sense to me is:
>>
>> if ((master + slave) < llc_size)
>>   return 0;
>>
>> In other words, if the master flips and the slave flips are totally
>> higher than the llc_size, then we are most likely waking up too many
>> tasks as affine and should then switch to wide to prevent overloading.
>>
>> Digging further into the original patch from Michael Wang (I also CC'd
>> him), this was the code (before you had changed it to master/slave):
>>
>> wakee->nr_wakee_switch > factor &&
>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>
>> To explain the second condition above, Michael Wang said the following in [1]
>>
>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>> tasks rely on it, then waker's higher latency will damage all of them, pull
>> wakee seems to be a bad deal."
>>
>> Again I didn't follow why the second condition couldn't just be:
>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>> Micheal Wang that I quoted.
>> and why he's instead doing the whole multiplication thing there that I
>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>>
>> Rephrasing my question in another way, why are we talking the ratio of
>> master/slave instead of the sum when comparing if its > factor? I am
>> surely missing something here.
>>
>> Just taking an example:
>>
>> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
>> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
>> between all 3 masters. Also to make it a bit more interesting, let s8
>> wake up some random task T0. A diagram to show the master/slave
>> relation ships could look like:
>>
>>                                     +-----+
>>                                     |     |
>> +------------------------+   +------+ M2  |
>> |                        |   |      |     |
>> |        M1              |   |      +--+--+----
>> |                        |   |      |  |  |   |
>> |                        |   |      |  |  |   s15
>> +--+--+--+--+--+--+---+--+---+      v  v  v
>>    |  |  |  |  |  |   |      |      s9 s10 s11
>>    v  v  v  v  v  v   v      v
>>    s1 s2 s3 s4 s5 s6  s7     s8 ---> T0
>>                              ^
>>                              |
>>                            +-+---+
>>                            |     |
>>                            | M3  |
>>                            |     |
>>                            +--+--+-----
>>                            |  |  |    |
>>                            v  v  v    v
>>                           s12 s13 s14 s16
>>
>>
>> Lets consider the case of M1 waking up s8. As per the above diagram,
>> M1 has 8 flips and s8 has 4 flips.
>>
>> With llc_size = 3, the condition
>>
>> (slave < factor) would return FALSE, so then we would turn to the
>> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
>> so wake_wide would return 0 and would cause s8 to be woken up as
>> affine with relation to M1's core.
>>
>> So basically, it seems the heuristic is saying (with help of the
>> second condition - master < slave * factor). that Its a good idea for
>> s8 to be affine-woken-up with respect to M1's core. Why is it a good
>> idea to do that? It seems to me M1 has already several tasks its
>> waking as affine so causing s8 to be woken up affine could be harmful
>> and it may be a better choice to wake it up elsewhere.
>>
>> Thanks for your help!
>>
>> -Joel
>>
>> [1] https://lkml.org/lkml/2013/7/4/20
>>
>>
>>>
>>> Something else to consider: network interrupt waking multiple workers
>>> at high frequency.  If the waking CPU is idle, do you really want to
>>> place a worker directly in front of a tattoo artist, or is it better
>>> off nearly anywhere but there?
>>>
>>> If the box is virtual, with no topology exposed (or real but ancient)
>>> to let select_idle_sibling() come to the rescue, two workers can even
>>> get tattooed simultaneously (see sync wakeup).
>>>
>>>         -Mike