[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220624020723.GA11803@chenyu5-mobl1>
Date: Fri, 24 Jun 2022 10:07:23 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Mel Gorman <mgorman@...e.de>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Barry Song <21cnbao@...il.com>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
Len Brown <len.brown@...el.com>,
Ben Segall <bsegall@...gle.com>,
Aubrey Li <aubrey.li@...el.com>,
Abel Wu <wuyun.abel@...edance.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org,
Yicong Yang <yangyicong@...ilicon.com>,
Mohini Narkhede <mohini.narkhede@...el.com>
Subject: Re: [PATCH v4] sched/fair: Introduce SIS_UTIL to search idle CPU
based on sum of util_avg
Hi Prateek,
On Wed, Jun 22, 2022 at 12:06:55PM +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> I'm sorry for the delay. The testing took a while but below are
> the results from testing on our system.
>
> tl;dr
>
> o We ran all the tests with with SIS_PROP disabled.
> o tbench reaches close to saturation early with 256 clients.
> o schbench shows improvements for low worker counts.
> o All other benchmark results seem comparable to tip.
> We don't see any serious regressions with v4.
>
> I've added detailed benchmark results and some analysis below.
>
Thanks very much for the test.
> On 6/12/2022 10:04 PM, Chen Yu wrote:
> > [Problem Statement]
> > select_idle_cpu() might spend too much time searching for an idle CPU,
> > when the system is overloaded.
> >
> > The following histogram is the time spent in select_idle_cpu(),
> > when running 224 instances of netperf on a system with 112 CPUs
> > per LLC domain:
> >
> > @usecs:
> > [0] 533 | |
> > [1] 5495 | |
> > [2, 4) 12008 | |
> > [4, 8) 239252 | |
> > [8, 16) 4041924 |@@@@@@@@@@@@@@ |
> > [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [256, 512) 4507667 |@@@@@@@@@@@@@@@ |
> > [512, 1K) 2600472 |@@@@@@@@@ |
> > [1K, 2K) 927912 |@@@ |
> > [2K, 4K) 218720 | |
> > [4K, 8K) 98161 | |
> > [8K, 16K) 37722 | |
> > [16K, 32K) 6715 | |
> > [32K, 64K) 477 | |
> > [64K, 128K) 7 | |
> >
> > netperf latency usecs:
> > =======
> > case load Lat_99th std%
> > TCP_RR thread-224 257.39 ( 0.21)
> >
> > The time spent in select_idle_cpu() is visible to netperf and might have a negative
> > impact.
> >
> > [Symptom analysis]
> > The patch [1] from Mel Gorman has been applied to track the efficiency
> > of select_idle_sibling. Copy the indicators here:
> >
> > SIS Search Efficiency(se_eff%):
> > A ratio expressed as a percentage of runqueues scanned versus
> > idle CPUs found. A 100% efficiency indicates that the target,
> > prev or recent CPU of a task was idle at wakeup. The lower the
> > efficiency, the more runqueues were scanned before an idle CPU
> > was found.
> >
> > SIS Domain Search Efficiency(dom_eff%):
> > Similar, except only for the slower SIS
> > patch.
> >
> > SIS Fast Success Rate(fast_rate%):
> > Percentage of SIS that used target, prev or
> > recent CPUs.
> >
> > SIS Success rate(success_rate%):
> > Percentage of scans that found an idle CPU.
> >
> > The test is based on Aubrey's schedtests tool, including netperf, hackbench,
> > schbench and tbench.
> >
> > Test on vanilla kernel:
> > schedstat_parse.py -f netperf_vanilla.log
> > case load se_eff% dom_eff% fast_rate% success_rate%
> > TCP_RR 28 threads 99.978 18.535 99.995 100.000
> > TCP_RR 56 threads 99.397 5.671 99.964 100.000
> > TCP_RR 84 threads 21.721 6.818 73.632 100.000
> > TCP_RR 112 threads 12.500 5.533 59.000 100.000
> > TCP_RR 140 threads 8.524 4.535 49.020 100.000
> > TCP_RR 168 threads 6.438 3.945 40.309 99.999
> > TCP_RR 196 threads 5.397 3.718 32.320 99.982
> > TCP_RR 224 threads 4.874 3.661 25.775 99.767
> > UDP_RR 28 threads 99.988 17.704 99.997 100.000
> > UDP_RR 56 threads 99.528 5.977 99.970 100.000
> > UDP_RR 84 threads 24.219 6.992 76.479 100.000
> > UDP_RR 112 threads 13.907 5.706 62.538 100.000
> > UDP_RR 140 threads 9.408 4.699 52.519 100.000
> > UDP_RR 168 threads 7.095 4.077 44.352 100.000
> > UDP_RR 196 threads 5.757 3.775 35.764 99.991
> > UDP_RR 224 threads 5.124 3.704 28.748 99.860
> >
> > schedstat_parse.py -f schbench_vanilla.log
> > (each group has 28 tasks)
> > case load se_eff% dom_eff% fast_rate% success_rate%
> > normal 1 mthread 99.152 6.400 99.941 100.000
> > normal 2 mthreads 97.844 4.003 99.908 100.000
> > normal 3 mthreads 96.395 2.118 99.917 99.998
> > normal 4 mthreads 55.288 1.451 98.615 99.804
> > normal 5 mthreads 7.004 1.870 45.597 61.036
> > normal 6 mthreads 3.354 1.346 20.777 34.230
> > normal 7 mthreads 2.183 1.028 11.257 21.055
> > normal 8 mthreads 1.653 0.825 7.849 15.549
> >
> > schedstat_parse.py -f hackbench_vanilla.log
> > (each group has 28 tasks)
> > case load se_eff% dom_eff% fast_rate% success_rate%
> > process-pipe 1 group 99.991 7.692 99.999 100.000
> > process-pipe 2 groups 99.934 4.615 99.997 100.000
> > process-pipe 3 groups 99.597 3.198 99.987 100.000
> > process-pipe 4 groups 98.378 2.464 99.958 100.000
> > process-pipe 5 groups 27.474 3.653 89.811 99.800
> > process-pipe 6 groups 20.201 4.098 82.763 99.570
> > process-pipe 7 groups 16.423 4.156 77.398 99.316
> > process-pipe 8 groups 13.165 3.920 72.232 98.828
> > process-sockets 1 group 99.977 5.882 99.999 100.000
> > process-sockets 2 groups 99.927 5.505 99.996 100.000
> > process-sockets 3 groups 99.397 3.250 99.980 100.000
> > process-sockets 4 groups 79.680 4.258 98.864 99.998
> > process-sockets 5 groups 7.673 2.503 63.659 92.115
> > process-sockets 6 groups 4.642 1.584 58.946 88.048
> > process-sockets 7 groups 3.493 1.379 49.816 81.164
> > process-sockets 8 groups 3.015 1.407 40.845 75.500
> > threads-pipe 1 group 99.997 0.000 100.000 100.000
> > threads-pipe 2 groups 99.894 2.932 99.997 100.000
> > threads-pipe 3 groups 99.611 4.117 99.983 100.000
> > threads-pipe 4 groups 97.703 2.624 99.937 100.000
> > threads-pipe 5 groups 22.919 3.623 87.150 99.764
> > threads-pipe 6 groups 18.016 4.038 80.491 99.557
> > threads-pipe 7 groups 14.663 3.991 75.239 99.247
> > threads-pipe 8 groups 12.242 3.808 70.651 98.644
> > threads-sockets 1 group 99.990 6.667 99.999 100.000
> > threads-sockets 2 groups 99.940 5.114 99.997 100.000
> > threads-sockets 3 groups 99.469 4.115 99.977 100.000
> > threads-sockets 4 groups 87.528 4.038 99.400 100.000
> > threads-sockets 5 groups 6.942 2.398 59.244 88.337
> > threads-sockets 6 groups 4.359 1.954 49.448 87.860
> > threads-sockets 7 groups 2.845 1.345 41.198 77.102
> > threads-sockets 8 groups 2.871 1.404 38.512 74.312
> >
> > schedstat_parse.py -f tbench_vanilla.log
> > case load se_eff% dom_eff% fast_rate% success_rate%
> > loopback 28 threads 99.976 18.369 99.995 100.000
> > loopback 56 threads 99.222 7.799 99.934 100.000
> > loopback 84 threads 19.723 6.819 70.215 100.000
> > loopback 112 threads 11.283 5.371 55.371 99.999
> > loopback 140 threads 0.000 0.000 0.000 0.000
> > loopback 168 threads 0.000 0.000 0.000 0.000
> > loopback 196 threads 0.000 0.000 0.000 0.000
> > loopback 224 threads 0.000 0.000 0.000 0.000
> >
> > According to the test above, if the system becomes busy, the
> > SIS Search Efficiency(se_eff%) drops significantly. Although some
> > benchmarks would finally find an idle CPU(success_rate% = 100%), it is
> > doubtful whether it is worth it to search the whole LLC domain.
> >
> > [Proposal]
> > It would be ideal to have a crystal ball to answer this question:
> > How many CPUs must a wakeup path walk down, before it can find an idle
> > CPU? Many potential metrics could be used to predict the number.
> > One candidate is the sum of util_avg in this LLC domain. The benefit
> > of choosing util_avg is that it is a metric of accumulated historic
> > activity, which seems to be smoother than instantaneous metrics
> > (such as rq->nr_running). Besides, choosing the sum of util_avg
> > would help predict the load of the LLC domain more precisely, because
> > SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
> > time.
> >
> > In summary, the lower the util_avg is, the more select_idle_cpu()
> > should scan for idle CPU, and vice versa. When the sum of util_avg
> > in this LLC domain hits 85% or above, the scan stops. The reason to
> > choose 85% as the threshold is that this is the imbalance_pct(117)
> > when a LLC sched group is overloaded.
> >
> > Introduce the quadratic function:
> >
> > y = SCHED_CAPACITY_SCALE - p * x^2
> > and y'= y / SCHED_CAPACITY_SCALE
> >
> > x is the ratio of sum_util compared to the CPU capacity:
> > x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
> > y' is the ratio of CPUs to be scanned in the LLC domain,
> > and the number of CPUs to scan is calculated by:
> >
> > nr_scan = llc_weight * y'
> >
> > Choosing quadratic function is because:
> > [1] Compared to the linear function, it scans more aggressively when the
> > sum_util is low.
> > [2] Compared to the exponential function, it is easier to calculate.
> > [3] It seems that there is no accurate mapping between the sum of util_avg
> > and the number of CPUs to be scanned. Use heuristic scan for now.
> >
> > For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> > sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
> > scan_nr 112 111 108 102 93 81 65 47 25 1 0 ...
> >
> > For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
> > sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
> > scan_nr 16 15 15 14 13 11 9 6 3 0 0 ...
> >
> > Furthermore, to minimize the overhead of calculating the metrics in
> > select_idle_cpu(), borrow the statistics from periodic load balance.
> > As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> > sum_util calculated by periodic load balance after 112 ms would
> > decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
> > in reflecting the latest utilization. But it is a trade-off.
> > Checking the util_avg in newidle load balance would be more frequent,
> > but it brings overhead - multiple CPUs write/read the per-LLC shared
> > variable and introduces cache contention. Tim also mentioned that,
> > it is allowed to be non-optimal in terms of scheduling for the
> > short-term variations, but if there is a long-term trend in the load
> > behavior, the scheduler can adjust for that.
> >
> > When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
> > calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
> > Mel suggested, SIS_UTIL should be enabled by default.
> >
> > This patch is based on the util_avg, which is very sensitive to the
> > CPU frequency invariance. There is an issue that, when the max frequency
> > has been clamp, the util_avg would decay insanely fast when
> > the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
> > in frequency invariance") could be used to mitigate this symptom, by adjusting
> > the arch_max_freq_ratio when turbo is disabled. But this issue is still
> > not thoroughly fixed, because the current code is unaware of the user-specified
> > max CPU frequency.
> >
> > [Test result]
> >
> > netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
> > 175% 200% of CPU number respectively. Hackbench and schbench were launched
> > by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
> >
> > The following is the benchmark result comparison between
> > baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
> > indicates better performance.
> >
> > Each netperf test is a:
> > netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
> > netperf.throughput
> > =======
> > case load baseline(std%) compare%( std%)
> > TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40)
> > TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20)
> > TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40)
> > TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22)
> > TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19)
> > TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18)
> > TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43)
> > TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85)
> > UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33)
> > UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21)
> > UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32)
> > UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61)
> > UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04)
> > UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16)
> > UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93)
> > UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62)
> >
> > Take the 224 threads as an example, the SIS search metrics changes are
> > illustrated below:
> >
> > vanilla patched
> > 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg
> > 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg
> > 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg
> > 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg
> >
> > There is -87.9% less CPU scans after patched, which indicates lower overhead.
> > Besides, with this patch applied, there is -13% less rq lock contention
> > in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
> > .try_to_wake_up.default_wake_function.woken_wake_function.
> > This might help explain the performance improvement - Because this patch allows
> > the waking task to remain on the previous CPU, rather than grabbing other CPUs'
> > lock.
> >
> > Each hackbench test is a:
> > hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
> > hackbench.throughput
> > =========
> > case load baseline(std%) compare%( std%)
> > process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47)
> > process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81)
> > process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02)
> > process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02)
> > process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13)
> > process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14)
> > process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26)
> > process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03)
> > threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14)
> > threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74)
> > threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13)
> > threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06)
> > threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53)
> > threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26)
> > threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24)
> > threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11)
> >
> > Each tbench test is a:
> > tbench -t 100 $job 127.0.0.1
> > tbench.throughput
> > ======
> > case load baseline(std%) compare%( std%)
> > loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09)
> > loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17)
> > loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13)
> > loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03)
> > loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19)
> > loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27)
> > loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17)
> > loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22)
> >
> > Each schbench test is a:
> > schbench -m $job -t 28 -r 100 -s 30000 -c 30000
> > schbench.latency_90%_us
> > ========
> > case load baseline(std%) compare%( std%)
> > normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)*
> > normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79)
> > normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64)
> > normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28)
>
>
> Following are the results from dual socket Zen3 platform (2 x 64C/128T) running with
> various NPS configuration:
>
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> Kernel versions:
> - tip: 5.19-rc2 tip sched/core
> - SIS_UTIL: 5.19-rc2 tip sched/core + this patch
>
> When we started testing, the tip was at:
> commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"
>
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
>
> NPS1
>
> Test: tip SIS_UTIL
> 1-groups: 4.64 (0.00 pct) 4.77 (-2.80 pct)
> 2-groups: 5.22 (0.00 pct) 5.17 (0.95 pct)
> 4-groups: 5.43 (0.00 pct) 5.29 (2.57 pct)
> 8-groups: 5.85 (0.00 pct) 5.75 (1.70 pct)
> 16-groups: 7.54 (0.00 pct) 7.62 (-1.06 pct)
>
> NPS2
>
> Test: tip SIS_UTIL
> 1-groups: 4.61 (0.00 pct) 4.79 (-3.90 pct)
> 2-groups: 5.00 (0.00 pct) 4.94 (1.20 pct)
> 4-groups: 5.14 (0.00 pct) 5.00 (2.72 pct)
> 8-groups: 5.66 (0.00 pct) 5.49 (3.00 pct)
> 16-groups: 7.54 (0.00 pct) 7.33 (2.78 pct)
>
> NPS4
>
> Test: tip SIS_UTIL
> 1-groups: 4.64 (0.00 pct) 4.69 (-1.07 pct)
> 2-groups: 5.03 (0.00 pct) 4.98 (0.99 pct)
> 4-groups: 5.66 (0.00 pct) 5.88 (-3.88 pct)
> 8-groups: 6.16 (0.00 pct) 6.14 (0.32 pct)
> 16-groups: 7.37 (0.00 pct) 9.60 (-30.25 pct) * (System overloaded)
> 16-groups: 7.38 (0.00 pct) 7.99 (-8.26 pct) [Verification Run]
>
> ~~~~~~~~
> schbench
> ~~~~~~~~
>
> NPS1
>
> #workers: tip SIS_UTIL
> 1: 23.50 (0.00 pct) 20.00 (14.89 pct)
> 2: 33.00 (0.00 pct) 29.50 (10.60 pct)
> 4: 43.50 (0.00 pct) 40.00 (8.04 pct)
> 8: 52.50 (0.00 pct) 50.00 (4.76 pct)
> 16: 70.00 (0.00 pct) 72.50 (-3.57 pct)
> 32: 103.50 (0.00 pct) 100.50 (2.89 pct)
> 64: 175.50 (0.00 pct) 183.00 (-4.27 pct)
> 128: 362.00 (0.00 pct) 368.50 (-1.79 pct)
> 256: 867.00 (0.00 pct) 867.00 (0.00 pct)
> 512: 60224.00 (0.00 pct) 58368.00 (3.08 pct)
>
> NPS2
>
> #workers: tip SIS_UTIL
> 1: 19.50 (0.00 pct) 17.00 (12.82 pct)
> 2: 31.50 (0.00 pct) 21.50 (31.74 pct)
> 4: 39.00 (0.00 pct) 31.50 (19.23 pct)
> 8: 54.50 (0.00 pct) 46.00 (15.59 pct)
> 16: 73.50 (0.00 pct) 78.00 (-6.12 pct) *
> 16: 74.00 (0.00 pct) 76.00 (-2.70 pct) [Verification Run]
> 32: 105.00 (0.00 pct) 100.00 (4.76 pct)
> 64: 181.50 (0.00 pct) 176.00 (3.03 pct)
> 128: 368.50 (0.00 pct) 368.00 (0.13 pct)
> 256: 885.00 (0.00 pct) 875.00 (1.12 pct)
> 512: 58752.00 (0.00 pct) 59520.00 (-1.30 pct)
>
> NPS4
>
> #workers: tip SIS_UTIL
> 1: 19.00 (0.00 pct) 15.50 (18.42 pct)
> 2: 32.00 (0.00 pct) 21.50 (32.81 pct)
> 4: 36.50 (0.00 pct) 29.00 (20.54 pct)
> 8: 47.50 (0.00 pct) 51.00 (-7.36 pct) *
> 8: 49.50 (0.00 pct) 44.50 (10.10 pct) [Verification Run]
> 16: 74.50 (0.00 pct) 78.00 (-4.69 pct) *
> 16: 81.50 (0.00 pct) 73.00 (10.42 pct) [Verification Run]
> 32: 98.50 (0.00 pct) 101.50 (-3.04 pct)
> 64: 182.00 (0.00 pct) 185.50 (-1.92 pct)
> 128: 369.50 (0.00 pct) 384.00 (-3.92 pct)
> 256: 920.00 (0.00 pct) 901.00 (2.06 pct)
> 512: 60224.00 (0.00 pct) 59136.00 (1.80 pct)
>
> ~~~~~~
> tbench
> ~~~~~~
>
> NPS1
>
> Clients: tip SIS_UTIL
> 1 444.41 (0.00 pct) 445.90 (0.33 pct)
> 2 879.23 (0.00 pct) 871.32 (-0.89 pct)
> 4 1648.83 (0.00 pct) 1648.23 (-0.03 pct)
> 8 3263.81 (0.00 pct) 3251.66 (-0.37 pct)
> 16 6011.19 (0.00 pct) 5997.98 (-0.21 pct)
> 32 12058.31 (0.00 pct) 11625.00 (-3.59 pct)
> 64 21258.21 (0.00 pct) 20847.13 (-1.93 pct)
> 128 30795.27 (0.00 pct) 29286.06 (-4.90 pct) *
> 128 29848.21 (0.00 pct) 31613.76 (5.91 pct) [Verification run]
> 256 25138.43 (0.00 pct) 51160.59 (103.51 pct)
> 512 51287.93 (0.00 pct) 51829.94 (1.05 pct)
> 1024 53176.97 (0.00 pct) 53211.32 (0.06 pct)
>
> NPS2
>
> Clients: tip SIS_UTIL
> 1 445.45 (0.00 pct) 447.64 (0.49 pct)
> 2 869.24 (0.00 pct) 868.63 (-0.07 pct)
> 4 1644.28 (0.00 pct) 1632.35 (-0.72 pct)
> 8 3120.83 (0.00 pct) 3157.00 (1.15 pct)
> 16 5972.29 (0.00 pct) 5679.18 (-4.90 pct) *
> 16 5668.91 (0.00 pct) 5701.57 (0.57 pct) [Verification run]
> 32 11776.38 (0.00 pct) 11253.96 (-4.43 pct) *
> 32 11668.66 (0.00 pct) 11272.02 (-3.39 pct) [Verification run]
> 64 20933.15 (0.00 pct) 20717.28 (-1.03 pct)
> 128 32195.00 (0.00 pct) 30400.11 (-5.57 pct) *
> 128 30248.19 (0.00 pct) 30781.22 (1.76 pct) [Verification run]
> 256 24641.52 (0.00 pct) 44940.70 (82.37 pct)
> 512 50806.96 (0.00 pct) 51937.08 (2.22 pct)
> 1024 51993.96 (0.00 pct) 52154.38 (0.30 pct)
>
> NPS4
>
> Clients: tip SIS_UTIL
> 1 442.10 (0.00 pct) 449.20 (1.60 pct)
> 2 870.94 (0.00 pct) 875.15 (0.48 pct)
> 4 1615.30 (0.00 pct) 1636.92 (1.33 pct)
> 8 3195.95 (0.00 pct) 3222.69 (0.83 pct)
> 16 5937.41 (0.00 pct) 5705.23 (-3.91 pct)
> 32 11800.41 (0.00 pct) 11337.91 (-3.91 pct)
> 64 20844.71 (0.00 pct) 20123.99 (-3.45 pct)
> 128 31003.62 (0.00 pct) 30219.39 (-2.52 pct)
> 256 27476.37 (0.00 pct) 49333.89 (79.55 pct)
> 512 52276.72 (0.00 pct) 50807.17 (-2.81 pct)
> 1024 51372.10 (0.00 pct) 51566.42 (0.37 pct)
>
> Note: tbench resuts for 256 workers are known to have
> run to run variation on the test machine. Any regression
> seen for the data point can be safely ignored.
>
> ~~~~~~
> Stream
> ~~~~~~
>
> - 10 runs
>
> NPS1
>
> Test: tip SIS_UTIL
> Copy: 152431.37 (0.00 pct) 165782.13 (8.75 pct)
> Scale: 187983.72 (0.00 pct) 180133.46 (-4.17 pct)
> Add: 211713.09 (0.00 pct) 205588.71 (-2.89 pct)
> Triad: 207302.09 (0.00 pct) 201103.81 (-2.98 pct)
>
> NPS2
>
> Test: tip SIS_UTIL
> Copy: 134099.98 (0.00 pct) 146487.66 (9.23 pct)
> Scale: 168404.01 (0.00 pct) 180551.46 (7.21 pct)
> Add: 184326.77 (0.00 pct) 197117.20 (6.93 pct)
> Triad: 182707.29 (0.00 pct) 195282.60 (6.88 pct)
>
> NPS4
>
> Test: tip SIS_UTIL
> Copy: 123058.63 (0.00 pct) 129624.17 (5.33 pct)
> Scale: 178696.74 (0.00 pct) 182611.49 (2.19 pct)
> Add: 169836.95 (0.00 pct) 179869.80 (5.90 pct)
> Triad: 170036.21 (0.00 pct) 177249.46 (4.24 pct)
>
> - 100 runs
>
> NPS1
>
> Test: tip SIS_UTIL
> Copy: 215860.05 (0.00 pct) 205953.63 (-4.58 pct)
> Scale: 207886.55 (0.00 pct) 203384.29 (-2.16 pct)
> Add: 253513.05 (0.00 pct) 243351.95 (-4.00 pct)
> Triad: 239471.82 (0.00 pct) 232221.90 (-3.02 pct)
>
> NPS2
>
> Test: tip SIS_UTIL
> Copy: 223991.94 (0.00 pct) 217920.18 (-2.71 pct)
> Scale: 205631.20 (0.00 pct) 213060.40 (3.61 pct)
> Add: 252292.90 (0.00 pct) 266848.26 (5.76 pct)
> Triad: 239838.71 (0.00 pct) 252369.51 (5.22 pct)
>
> NPS4
>
> Test: tip SIS_UTIL
> Copy: 225480.09 (0.00 pct) 218902.02 (-2.91 pct)
> Scale: 218218.59 (0.00 pct) 210839.93 (-3.38 pct)
> Add: 273879.95 (0.00 pct) 261761.62 (-4.42 pct)
> Triad: 255765.98 (0.00 pct) 246971.11 (-3.43 pct)
>
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
>
> NPS1
>
> sched-tip: 301330.33 (var: 3.28)
> SIS_UTIL: 295360.33 (var: 0.76) (-1.98%)
>
> NPS2
>
> sched-tip: 287786.00 (var: 4.24)
> SIS_UTIL: 288888.33 (var: 1.58) (+0.38%)
>
> NPS4
>
> sched-tip: 293671.00 (var: 0.89)
> SIS_UTIL: 295682.33 (var: 0.92) (+0.68%)
>
>
> ~~~~~
> Notes
> ~~~~~
>
> o tbench reaches close to saturation at 256 clients which was
> previously an unreliable data point and usually showed regression
> compared to the result with 128 clients.
> o schbench improves for low worker count. It is not strictly because
> of SIS_UTIL.
> o Most serious regression seen seem to reverse with a rerun suggesting
> some run to run variance with few data points on tip as well as with
> this patch.
> o Any small regression or improvements seen are within the margin of
> run to run variance seen on the tip as well. The results seem to be
> more stable with SIS_UTIL compared to SIS_PROP
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SIS Efficiency Stats for Hackbench
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Following are the system wide SIS Efficiency stats for SIS_PROP and SIS_UTIL
> when running hackbench with Mel's patch applied as is on both kernels:
> (https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/)
>
> Metrics and the labels assigned for better readability
>
> SIS Search : Number of calls to select_idle_sibling
> SIS Domain Search : Number of times the domain was searched (fast path failed)
> SIS Scanned : Number of runqueues scanned
> SIS Failures : Number of SIS calls that failed to find an idle CPU
>
> SIS Logic: SIS_PROP SIS_UTIL Diff (SIS_UTIL wrt SIS_PROP)
>
> o 1-group
>
> Benchmark Results (sec) : 4.823 4.841 (-0.37 pct)
> Number of calls to select_idle_sibling : 3154397 3166395 (0.38 pct)
> Number of times the domain was searched (fast path failed) : 931530 1349865 (44.91 pct)
> Number of runqueues scanned : 7846894 11026784 (40.52 pct)
> Number of SIS calls that failed to find an idle CPU : 76463 118968 (55.59 pct)
> Avg. No. of runqueues scanned per domain search : 8.42 8.16 (-3.09 pct)
>
> o 2-groups
>
> Benchmark Results (sec) : 4.705 4.912 (-4.40 pct)
> Number of calls to select_idle_sibling : 3521182 4879821 (38.58 pct)
> Number of times the domain was searched (fast path failed) : 2049034 2979202 (45.40 pct)
> Number of runqueues scanned : 16717385 24743444 (48.01 pct)
> Number of SIS calls that failed to find an idle CPU : 366643 241789 (-34.05 pct)
> Avg. No. of runqueues scanned per domain search : 8.15 8.30 (1.84 pct)
>
> o 4-groups
>
> Benchmark Results (sec) : 5.503 5.268 (4.27 pct)
> Number of calls to select_idle_sibling : 13293368 11006088 (-17.21 pct)
> Number of times the domain was searched (fast path failed) : 5487436 4604635 (-16.09 pct)
> Number of runqueues scanned : 53028113 43238439 (-18.46 pct)
> Number of SIS calls that failed to find an idle CPU : 1171727 1040776 (-11.18 pct)
> Avg. No. of runqueues scanned per domain search : 9.66 9.39 (-2.80 pct)
>
> o 8-groups
>
> Benchmark Results (sec) : 5.794 5.752 (0.72 pct)
> Number of calls to select_idle_sibling : 26367244 24734896 (-6.19 pct)
> Number of times the domain was searched (fast path failed) : 11137288 9528659 (-14.44 pct)
> Number of runqueues scanned : 106216549 91895107 (-13.48 pct)
> Number of SIS calls that failed to find an idle CPU : 3154674 3012751 (-4.50 pct)
> Avg. No. of runqueues scanned per domain search : 9.53 9.64 (1.15 pct)
>
> o 16-groups
>
> Benchmark Results (sec) : 7.405 7.363 (0.57 pct)
> Number of calls to select_idle_sibling : 57323447 49331195 (-13.94 pct)
> Number of times the domain was searched (fast path failed) : 27853188 23892530 (-14.22 pct)
> Number of runqueues scanned : 248062785 180150761 (-27.38 pct)
> Number of SIS calls that failed to find an idle CPU : 12182277 14125960 (15.96 pct)
> Avg. No. of runqueues scanned per domain search : 8.90 7.54 (-15.28 pct)
>
> For 16 groups, when comparing SIS_UTIL to SIS_PROP, the
> "Avg. No. of runqueues scanned per domain search" goes down and we
> know there is high chance we won't find an idle CPU but it is
> still relatively high for lower number of groups where the
> opportunity to find idle cpus is more.
>
> >
> > [..snip..]
> >
> > #define NUMA_IMBALANCE_MIN 2
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 1cf435bbcd9c..3334a1b93fc6 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> > */
> > SCHED_FEAT(SIS_PROP, true)
>
> SIS_PROP was disabled in our testing as follows:
>
> --
> -SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_PROP, false)
> --
>
> > +SCHED_FEAT(SIS_UTIL, true)
> >
> > /*
> > * Issue a WARN when we do multiple update_rq_clock() calls
> >
> > [..snip..]
> >
>
> With v4 on the current tip, I don't see any need for
> a special case for systems with smaller LLCs with
> SIS_PROP disabled and SIS_UITL enable. Even SIS Efficiency
> seems to be better with SIS_UTIL for hackbench.
>
> Tested-by: K Prateek Nayak <kprateek.nayak@....com>
Thanks again. Would you mind if I add this test report link into next patch
version?
thanks,
Chenyu
> --
> Thanks and Regards,
> Prateek
Powered by blists - more mailing lists