linux-kernel - Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <559AD9CE.4090309@fb.com>
Date:	Mon, 6 Jul 2015 15:41:02 -0400
From:	Josef Bacik <jbacik@...com>
To:	Mike Galbraith <umgwanakikbuti@...il.com>
CC:	Peter Zijlstra <peterz@...radead.org>, <riel@...hat.com>,
	<mingo@...hat.com>, <linux-kernel@...r.kernel.org>,
	<morten.rasmussen@....com>, kernel-team <Kernel-team@...com>
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for
 BALANCE_WAKE

On 07/06/2015 02:36 PM, Mike Galbraith wrote:
> On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote:
>> On 07/06/2015 01:13 AM, Mike Galbraith wrote:
>>> Hm.  Piddling with pgbench, which doesn't seem to collapse into a
>>> quivering heap when load exceeds cores these days, deltas weren't all
>>> that impressive, but it does appreciate the extra effort a bit, and a
>>> bit more when clients receive it as well.
>>>
>>> If you test, and have time to piddle, you could try letting wake_wide()
>>> return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
>>> the dispatcher.
>>>
>>> Numbers from my little desktop box.
>>>
>>> NO_WAKE_WIDE_IDLE
>>> postgres@...er:~> pgbench.sh
>>> clients 8       tps = 116697.697662
>>> clients 12      tps = 115160.230523
>>> clients 16      tps = 115569.804548
>>> clients 20      tps = 117879.230514
>>> clients 24      tps = 118281.753040
>>> clients 28      tps = 116974.796627
>>> clients 32      tps = 119082.163998   avg   117092.239   1.000
>>>
>>> WAKE_WIDE_IDLE
>>> postgres@...er:~> pgbench.sh
>>> clients 8       tps = 124351.735754
>>> clients 12      tps = 124419.673135
>>> clients 16      tps = 125050.716498
>>> clients 20      tps = 124813.042352
>>> clients 24      tps = 126047.442307
>>> clients 28      tps = 125373.719401
>>> clients 32      tps = 126711.243383   avg   125252.510   1.069   1.000
>>>
>>> WAKE_WIDE_IDLE (clients as well as server)
>>> postgres@...er:~> pgbench.sh
>>> clients 8       tps = 130539.795246
>>> clients 12      tps = 128984.648554
>>> clients 16      tps = 130564.386447
>>> clients 20      tps = 129149.693118
>>> clients 24      tps = 130211.119780
>>> clients 28      tps = 130325.355433
>>> clients 32      tps = 129585.656963   avg   129908.665   1.109   1.037
>
> I had a typo in my script, so those desktop box numbers were all doing
> the same number of clients.  It doesn't invalidate anything, but the
> individual deltas are just run to run variance.. not to mention that
> single cache box is not all that interesting for this anyway.  That
> happens when interconnect becomes a player.
>
>> I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
>> this solved upstream and then I'll rip out the old and put in the new,
>> I'm happy to screw around until we're all happy.  I'll throw this in a
>> kernel this morning and run stuff today.  Barring any issues with the
>> testing infrastructure I should have results today.  Thanks,
>
> I'll be interested in your results.  Taking pgbench to a little NUMA
> box, I'm seeing _nada_ outside of variance with master (crap).  I have a
> way to win significantly for _older_ kernels, and that win over master
> _may_ provide some useful insight, but I don't trust postgres/pgbench as
> far as I can toss the planet, so don't have a warm fuzzy about trying to
> use it to approximate your real world load.
>
> BTW, what's your topology look like (numactl --hardware).
>

So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the 
baseline with a slight regression at lower RPS and a slight improvement 
at high RPS.  I'm running with WAKE_WIDE_IDLE set now, that should be 
done soonish and then I'll do the 1 + sched_feat(WAKE_WIDE_IDLE) thing 
next and those results should come in the morning.  Here is the numa 
information from one of the boxes in the test cluster

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 15890 MB
node 0 free: 2651 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 16125 MB
node 1 free: 2063 MB
node distances:
node   0   1
   0:  10  20
   1:  20  10

Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/