lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 23 Apr 2018 17:41:13 -0700 From: subhra mazumdar <subhra.mazumdar@...cle.com> To: linux-kernel@...r.kernel.org Cc: peterz@...radead.org, mingo@...hat.com, daniel.lezcano@...aro.org, steven.sistare@...cle.com, dhaval.giani@...cle.com, rohit.k.jain@...cle.com, subhra.mazumdar@...cle.com Subject: [RFC/RFT PATCH 0/3] Improve scheduler scalability for fast path Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem by: -Removing select_idle_core() as it can potentially scan the full LLC domain even if there is only one idle core which doesn't scale -Lowering the lower limit of nr variable in select_idle_cpu() and also setting an upper limit to restrict search time Additionally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Following are the performance numbers with various benchmarks. Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5742 21.13 0.5334 (7.10%) 5.2 2 0.5776 7.87 0.5393 (6.63%) 6.39 4 0.9578 1.12 0.9537 (0.43%) 1.08 8 1.7018 1.35 1.682 (1.16%) 1.33 16 2.9955 1.36 2.9849 (0.35%) 0.96 32 5.4354 0.59 5.3308 (1.92%) 0.60 Sysbench MySQL on 1 socket, 6 core and 12 threads Intel x86 machine (higher is better): threads baseline patch 2 49.53 49.83 (0.61%) 4 89.07 90 (1.05%) 8 149 154 (3.31%) 16 240 246 (2.56%) 32 357 351 (-1.69%) 64 428 428 (-0.03%) 128 473 469 (-0.92%) Sysbench PostgresSQL on 1 socket, 6 core and 12 threads Intel x86 machine (higher is better): threads baseline patch 2 68.35 70.07 (2.51%) 4 93.53 92.54 (-1.05%) 8 125 127 (1.16%) 16 145 146 (0.92%) 32 158 156 (-1.24%) 64 160 160 (0.47%) Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline %stdev patch %stdev 20 1 1.35 1.0075 (0.75%) 0.71 40 1 0.42 0.9971 (-0.29%) 0.26 60 1 1.54 0.9955 (-0.45%) 0.83 80 1 0.58 1.0059 (0.59%) 0.59 100 1 0.77 1.0201 (2.01%) 0.39 120 1 0.35 1.0145 (1.45%) 1.41 140 1 0.19 1.0325 (3.25%) 0.77 160 1 0.09 1.0277 (2.77%) 0.57 180 1 0.99 1.0249 (2.49%) 0.79 200 1 1.03 1.0133 (1.33%) 0.77 220 1 1.69 1.0317 (3.17%) 1.41 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 49.47 0.35 50.96 (3.02%) 0.12 16 95.28 0.77 99.01 (3.92%) 0.14 32 156.77 1.17 180.64 (15.23%) 1.05 48 193.24 0.22 214.7 (11.1%) 1 64 216.21 9.33 252.81 (16.93%) 1.68 128 379.62 10.29 397.47 (4.75) 0.41 Dbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 627.62 629.14 (0.24%) 2 1153.45 1179.9 (2.29%) 4 2060.29 2051.62 (-0.42%) 8 2724.41 2609.4 (-4.22%) 16 2987.56 2891.54 (-3.21%) 32 2375.82 2345.29 (-1.29%) 64 1963.31 1903.61 (-3.04%) 128 1546.01 1513.17 (-2.12%) Tbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 279.33 285.154 (2.08%) 2 545.961 572.538 (4.87%) 4 1081.06 1126.51 (4.2%) 8 2158.47 2234.78 (3.53%) 16 4223.78 4358.11 (3.18%) 32 7117.08 8022.19 (12.72%) 64 8947.28 10719.7 (19.81%) 128 15976.7 17531.2 (9.73%) Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message size = 256 (higher is better): clients baseline %stdev patch %stdev 1 2699 4.86 2697 (-0.1%) 3.74 10 18832 0 18830 (0%) 0.01 100 18830 0.05 18827 (0%) 0.08 Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message size = 1K (higher is better): clients baseline %stdev patch %stdev 1 9414 0.02 9414 (0%) 0.01 10 18832 0 18832 (0%) 0 100 18830 0.05 18829 (0%) 0.04 Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message size = 4K (higher is better): clients baseline %stdev patch %stdev 1 9414 0.01 9414 (0%) 0 10 18832 0 18832 (0%) 0 100 18829 0.04 18833 (0%) 0 Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message size = 64K (higher is better): clients baseline %stdev patch %stdev 1 9415 0.01 9415 (0%) 0 10 18832 0 18832 (0%) 0 100 18830 0.04 18833 (0%) 0 Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message size = 1M (higher is better): clients baseline %stdev patch %stdev 1 9415 0.01 9415 (0%) 0.01 10 18832 0 18832 (0%) 0 100 18830 0.04 18819 (-0.1%) 0.13 JBB on 2 socket, 28 core and 56 threads Intel x86 machine (higher is better): baseline %stdev patch %stdev jops 60049 0.65 60191 (0.2%) 0.99 critical jops 29689 0.76 29044 (-2.2%) 1.46 Schbench on 2 socket, 24 core and 48 threads Intel x86 machine with 24 tasks (lower is better): percentile baseline %stdev patch %stdev 50 5007 0.16 5003 (0.1%) 0.12 75 10000 0 10000 (0%) 0 90 16992 0 16998 (0%) 0.12 95 21984 0 22043 (-0.3%) 0.83 99 34229 1.2 34069 (0.5%) 0.87 99.5 39147 1.1 38741 (1%) 1.1 99.9 49568 1.59 49579 (0%) 1.78 Ebizzy on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): threads baseline %stdev patch %stdev 1 26477 2.66 26646 (0.6%) 2.81 2 52303 1.72 52987 (1.3%) 1.59 4 100854 2.48 101824 (1%) 2.42 8 188059 6.91 189149 (0.6%) 1.75 16 328055 3.42 333963 (1.8%) 2.03 32 504419 2.23 492650 (-2.3%) 1.76 88 534999 5.35 569326 (6.4%) 3.07 156 541703 2.42 544463 (0.5%) 2.17 NAS: A whole suite of NAS benchmarks were run on 2 socket, 36 core and 72 threads Intel x86 machine with no statistically significant regressions while giving improvements in some cases. I am not listing the results due to too many data points. subhra mazumdar (3): sched: remove select_idle_core() for scalability sched: introduce per-cpu var next_cpu to track search limit sched: limit cpu search and rotate search window for scalability include/linux/sched/topology.h | 1 - kernel/sched/core.c | 2 + kernel/sched/fair.c | 116 +++++------------------------------------ kernel/sched/idle.c | 1 - kernel/sched/sched.h | 11 +--- 5 files changed, 17 insertions(+), 114 deletions(-) -- 2.9.3
Powered by blists - more mailing lists