linux-kernel - Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <83ebc153-a09c-5d84-f1d5-c12b10c1494b@oracle.com>
Date:   Wed, 7 Feb 2018 15:10:39 -0800
From:   Subhra Mazumdar <subhra.mazumdar@...cle.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Steven Sistare <steven.sistare@...cle.com>,
        linux-kernel@...r.kernel.org, mingo@...hat.com,
        dhaval.giani@...cle.com
Subject: Re: [RESEND RFC PATCH V3] sched: Improve scalability of
 select_idle_sibling using SMT balance



On 02/07/2018 12:42 AM, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 04:30:03PM -0800, Subhra Mazumdar wrote:
>
>> I meant the SMT balance patch. That does comparison with only one other
>> random core and takes the decision in O(1). Any potential scan of all cores
>> or cpus is O(n) and doesn't scale and will only get worse in future. That
>> applies to both select_idle_core() and select_idle_cpu().
> We only do the full scan if we think to know there is indeed an idle
> core to be had, and if there are idle cores the machine isn't terribly
> busy.
>
> If there are no idle cores, we do not in fact scan everything. We limit
> the amount of scanning based on the average idle time with a minimum of
> 4.
This logic may not be working as well as you may think it to be. I had
sent the cost of select_idle_sibling() w/ and w/o my patch and there was
huge difference:

Following is the cost (in us) of select_idle_sibling() with hackbench 16
groups:

function                 baseline-rc6  %stdev patch               %stdev
select_idle_sibling()    0.556         1.72    0.263 (-52.70%)     0.78
>
> (Except when you switch off things like SIS_PROP, then you scan
> unconditionally and reduce tail latency at the cost of throughput --
> like said, some people want this).
>
> O(1) sounds nice, but it has horrible worst case latencies.
>
> And like I said, I don't see how a rotor is particularly more random
> than the previous cpu the task you happen to be waking ran on.
The rotor randomness is not the main point here. I think the benchmark
improvements come from the fact that select_idle_sibling() cost has reduced
a lot. To reduce it while still maintaining good spread of threads can be
achieved by this SMT balance scheme which in turn requires a fast decent
random number generator and rotor is just an easy way to achieve that.
>> Is there any reason this randomized approach is not acceptable even if
>> benchmarks show improvement? Are there other benchmarks I should try?
> Looking at just one other core has a fairly high chance of not finding
> idle. I really cannot remember all the benchmarks, but Mike did
> something a little less random but still O(1) a few years back and that
> crashed and burned.
>
>> Also your suggestion to keep the SMT utilization but still do a traversal of
>> cores
>> in select_idle_core() while remembering the least loaded core will still
>> have
>> the problem of potentially traversing all cores. I can compare this with a
>> core
>> level only SMT balancing, is that useful to decide? I will also test on
>> SPARC
>> machines with higher degree of SMT.
> Please learn to use your email client, that's near unreadable for
> whitespace damage, time is really too precious to try and untangle crap
> like that.
Sorry about that
>
>> You had also mentioned to do it for only SMT >2, not sure I understand why
>> as even for SMT=2 (intel) benchmarks show improvement. This clearly shows
>> the scalability problem.
> For SMT2 you don't need the occupation counter with atomic crud, a
> simple !atomic core-idle works just fine. And your 'patch' had soo many
> moving parts it was too hard to say which aspect changed what.
>
> hackbench really isn't _that_ interesting a benchmark and its fairly
> easy to make it go faster, but doing so typically wrecks something else.
I also had uperf and Oracle DB TPC-C numbers in the patch which showed
improvements
>
> There's a whole array of netperf benchmarks which you have to run at
> different utilization levels, there's the sysbench oltp stuff, which you
> can run on MariaDB and PostgreSQL, again at different utilization levels
> (both databases behave quite differently). There's ebizzy, tbench,
> dbench and others.
>
> There's also NUMA stuff, like NAS benchmarks and specJBB nonsense.
>
> There's the facebook schbench, which is a bit finnicky to set up.
>
> And I'm sure I'm forgetting half, look at the patches and regression
> reports for more clues, that's what Google is for. You could have
> figured all that out yourself, much of these are listed in the
> changelogs and if you'd bothered to find lkml regression reports you
> could find more.
Ok, will find out and run them.

Thanks,
Subhra