[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b8eb593a-cde9-bb23-2092-6b563ce814c8@amd.com>
Date: Tue, 22 Nov 2022 16:58:17 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Abel Wu <wuyun.abel@...edance.com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>, Mel Gorman <mgorman@...e.de>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Valentin Schneider <valentin.schneider@....com>
Cc: Josh Don <joshdon@...gle.com>, Chen Yu <yu.c.chen@...el.com>,
Tim Chen <tim.c.chen@...ux.intel.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Aubrey Li <aubrey.li@...el.com>,
Qais Yousef <qais.yousef@....com>,
Juri Lelli <juri.lelli@...hat.com>,
Rik van Riel <riel@...riel.com>,
Yicong Yang <yangyicong@...wei.com>,
Barry Song <21cnbao@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
Hello Abel,
Following are the results for hackbench with larger number of
groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for
a regression in unixbench spawn in NPS2 and NPS4 mode and
unixbench syscall in NPs2 mode, everything looks good.
Detailed results are below:
~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~
o NPS1:
tip: 131696.33 (var: 2.03%)
sis_core: 129519.00 (var: 1.46%) (-1.65%)
o NPS2:
tip: 129895.33 (var: 2.34%)
sis_core: 130774.33 (var: 2.57%) (+0.67%)
o NPS4:
tip: 131165.00 (var: 1.06%)
sis_core: 133547.33 (var: 3.90%) (+1.81%)
~~~~~~~~~~~~~~~~~
~ Spec-JBB NPS1 ~
~~~~~~~~~~~~~~~~~
Max-jOPS and Critical-jOPS are same as the tip kernel.
~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~
-> unixbench-dhry2reg
o NPS1
kernel: tip sis_core
Min unixbench-dhry2reg-1 48876615.50 ( 0.00%) 48891544.00 ( 0.03%)
Min unixbench-dhry2reg-512 6260344658.90 ( 0.00%) 6282967594.10 ( 0.36%)
Hmean unixbench-dhry2reg-1 49299721.81 ( 0.00%) 49233828.70 ( -0.13%)
Hmean unixbench-dhry2reg-512 6267459427.19 ( 0.00%) 6288772961.79 * 0.34%*
CoeffVar unixbench-dhry2reg-1 0.90 ( 0.00%) 0.68 ( 24.66%)
CoeffVar unixbench-dhry2reg-512 0.10 ( 0.00%) 0.10 ( 7.54%)
o NPS2
kernel: tip sis_core
Min unixbench-dhry2reg-1 48828251.70 ( 0.00%) 48856709.20 ( 0.06%)
Min unixbench-dhry2reg-512 6244987739.10 ( 0.00%) 6271229549.10 ( 0.42%)
Hmean unixbench-dhry2reg-1 48869882.65 ( 0.00%) 49302481.81 ( 0.89%)
Hmean unixbench-dhry2reg-512 6261073948.84 ( 0.00%) 6272564898.35 ( 0.18%)
CoeffVar unixbench-dhry2reg-1 0.08 ( 0.00%) 0.87 (-945.28%)
CoeffVar unixbench-dhry2reg-512 0.23 ( 0.00%) 0.03 ( 85.94%)
o NPS4
kernel: tip sis_core
Min unixbench-dhry2reg-1 48523981.30 ( 0.00%) 49083957.50 ( 1.15%)
Min unixbench-dhry2reg-512 6253738837.10 ( 0.00%) 6271747119.10 ( 0.29%)
Hmean unixbench-dhry2reg-1 48781044.09 ( 0.00%) 49232218.87 * 0.92%*
Hmean unixbench-dhry2reg-512 6264428474.90 ( 0.00%) 6280484789.64 ( 0.26%)
CoeffVar unixbench-dhry2reg-1 0.46 ( 0.00%) 0.26 ( 42.63%)
CoeffVar unixbench-dhry2reg-512 0.17 ( 0.00%) 0.21 ( -26.72%)
-> unixbench-syscall
o NPS1
kernel: tip sis_core
Min unixbench-syscall-1 2975654.80 ( 0.00%) 2978489.40 ( -0.10%)
Min unixbench-syscall-512 7840226.50 ( 0.00%) 7822133.40 ( 0.23%)
Amean unixbench-syscall-1 2976326.47 ( 0.00%) 2980985.27 * -0.16%*
Amean unixbench-syscall-512 7850493.90 ( 0.00%) 7844527.50 ( 0.08%)
CoeffVar unixbench-syscall-1 0.03 ( 0.00%) 0.07 (-154.43%)
CoeffVar unixbench-syscall-512 0.13 ( 0.00%) 0.34 (-158.96%)
o NPS2
kernel: tip sis_core
Min unixbench-syscall-1 2969863.60 ( 0.00%) 2977936.50 ( -0.27%)
Min unixbench-syscall-512 8053157.60 ( 0.00%) 8072239.00 ( -0.24%)
Amean unixbench-syscall-1 2970462.30 ( 0.00%) 2981732.50 * -0.38%*
Amean unixbench-syscall-512 8061454.50 ( 0.00%) 8079287.73 * -0.22%*
CoeffVar unixbench-syscall-1 0.02 ( 0.00%) 0.11 (-527.26%)
CoeffVar unixbench-syscall-512 0.12 ( 0.00%) 0.08 ( 37.30%)
o NPS4
kernel: tip sis_core
Min unixbench-syscall-1 2971799.80 ( 0.00%) 2979335.60 ( -0.25%)
Min unixbench-syscall-512 7824196.90 ( 0.00%) 8155610.20 ( -4.24%)
Amean unixbench-syscall-1 2973045.43 ( 0.00%) 2982036.13 * -0.30%*
Amean unixbench-syscall-512 7826302.17 ( 0.00%) 8173026.57 * -4.43%* <-- Regression in syscall for larger worker count
CoeffVar unixbench-syscall-1 0.04 ( 0.00%) 0.09 (-139.63%)
CoeffVar unixbench-syscall-512 0.03 ( 0.00%) 0.20 (-701.13%)
-> unixbench-pipe
o NPS1
kernel: tip sis_core
Min unixbench-pipe-1 2894765.30 ( 0.00%) 2891505.30 ( -0.11%)
Min unixbench-pipe-512 329818573.50 ( 0.00%) 325610257.80 ( -1.28%)
Hmean unixbench-pipe-1 2898803.38 ( 0.00%) 2896940.25 ( -0.06%)
Hmean unixbench-pipe-512 330226401.69 ( 0.00%) 326311984.29 * -1.19%*
CoeffVar unixbench-pipe-1 0.14 ( 0.00%) 0.17 ( -21.99%)
CoeffVar unixbench-pipe-512 0.11 ( 0.00%) 0.20 ( -88.38%)
o NPS2
kernel: tip sis_core
Min unixbench-pipe-1 2895327.90 ( 0.00%) 2894798.20 ( -0.02%)
Min unixbench-pipe-512 328350065.60 ( 0.00%) 325681163.10 ( -0.81%)
Hmean unixbench-pipe-1 2899129.86 ( 0.00%) 2897067.80 ( -0.07%)
Hmean unixbench-pipe-512 329436096.80 ( 0.00%) 326023030.94 * -1.04%*
CoeffVar unixbench-pipe-1 0.12 ( 0.00%) 0.09 ( 21.96%)
CoeffVar unixbench-pipe-512 0.30 ( 0.00%) 0.12 ( 60.80%)
o NPS4
kernel: tip sis_core
Min unixbench-pipe-1 2901525.60 ( 0.00%) 2885730.80 ( -0.54%)
Min unixbench-pipe-512 330265873.90 ( 0.00%) 326730770.60 ( -1.07%)
Hmean unixbench-pipe-1 2906184.70 ( 0.00%) 2891616.18 * -0.50%*
Hmean unixbench-pipe-512 330854683.27 ( 0.00%) 327113296.63 * -1.13%*
CoeffVar unixbench-pipe-1 0.14 ( 0.00%) 0.19 ( -33.74%)
CoeffVar unixbench-pipe-512 0.16 ( 0.00%) 0.11 ( 31.75%)
-> unixbench-spawn
o NPS1
kernel: tip sis_core
Min unixbench-spawn-1 6536.50 ( 0.00%) 6000.30 ( -8.20%)
Min unixbench-spawn-512 72571.40 ( 0.00%) 70829.60 ( -2.40%)
Hmean unixbench-spawn-1 6811.16 ( 0.00%) 7016.11 ( 3.01%)
Hmean unixbench-spawn-512 72801.77 ( 0.00%) 71012.03 * -2.46%*
CoeffVar unixbench-spawn-1 3.69 ( 0.00%) 13.52 (-266.69%)
CoeffVar unixbench-spawn-512 0.27 ( 0.00%) 0.22 ( 18.25%)
o NPS2
kernel: tip sis_core
Min unixbench-spawn-1 7042.20 ( 0.00%) 7078.70 ( 0.52%)
Min unixbench-spawn-512 85571.60 ( 0.00%) 77362.60 ( -9.59%)
Hmean unixbench-spawn-1 7199.01 ( 0.00%) 7276.55 ( 1.08%)
Hmean unixbench-spawn-512 85717.77 ( 0.00%) 77923.73 * -9.09%* <-- Regression in spawn test for larger worker count
CoeffVar unixbench-spawn-1 3.50 ( 0.00%) 3.30 ( 5.70%)
CoeffVar unixbench-spawn-512 0.20 ( 0.00%) 0.82 (-304.88%)
o NPS4
kernel: tip sis_core
Min unixbench-spawn-1 7521.90 ( 0.00%) 8102.80 ( 7.72%)
Min unixbench-spawn-512 84245.70 ( 0.00%) 73074.50 ( -13.26%)
Hmean unixbench-spawn-1 7659.12 ( 0.00%) 8645.19 * 12.87%*
Hmean unixbench-spawn-512 84908.77 ( 0.00%) 73409.49 * -13.54%* <-- Regression in spawn test for larger worker count
CoeffVar unixbench-spawn-1 1.92 ( 0.00%) 5.78 (-200.56%)
CoeffVar unixbench-spawn-512 0.76 ( 0.00%) 0.41 ( 46.58%)
-> unixbench-execl
o NPS1
kernel: tip sis_core
Min unixbench-execl-1 5421.50 ( 0.00%) 5471.50 ( 0.92%)
Min unixbench-execl-512 11213.50 ( 0.00%) 11677.20 ( 4.14%)
Hmean unixbench-execl-1 5443.75 ( 0.00%) 5475.36 * 0.58%*
Hmean unixbench-execl-512 11311.94 ( 0.00%) 11804.52 * 4.35%*
CoeffVar unixbench-execl-1 0.38 ( 0.00%) 0.12 ( 69.22%)
CoeffVar unixbench-execl-512 1.03 ( 0.00%) 1.73 ( -68.91%)
o NPS2
kernel: tip sis_core
Min unixbench-execl-1 5089.10 ( 0.00%) 5405.40 ( 6.22%)
Min unixbench-execl-512 11772.70 ( 0.00%) 11917.20 ( 1.23%)
Hmean unixbench-execl-1 5321.65 ( 0.00%) 5421.41 ( 1.87%)
Hmean unixbench-execl-512 12201.73 ( 0.00%) 12327.95 ( 1.03%)
CoeffVar unixbench-execl-1 3.87 ( 0.00%) 0.28 ( 92.88%)
CoeffVar unixbench-execl-512 6.23 ( 0.00%) 5.78 ( 7.21%)
o NPS4
kernel: tip sis_core
Min unixbench-execl-1 5099.40 ( 0.00%) 5479.60 ( 7.46%)
Min unixbench-execl-512 11692.80 ( 0.00%) 12205.50 ( 4.38%)
Hmean unixbench-execl-1 5136.86 ( 0.00%) 5487.93 * 6.83%*
Hmean unixbench-execl-512 12053.71 ( 0.00%) 12712.96 ( 5.47%)
CoeffVar unixbench-execl-1 1.05 ( 0.00%) 0.14 ( 86.57%)
CoeffVar unixbench-execl-512 3.85 ( 0.00%) 5.86 ( -52.14%)
For unixbench regressions, I do not see anything obvious jump up
in perf traces captureed with IBS. top shows over 99% utilization
which would ideally mean there are not many updates to the mask.
I'll take some more look at the spawn test case and get back to you.
On 11/15/2022 4:58 PM, K Prateek Nayak wrote:
> Hello Abel,
>
> Thank you for taking a look at the report.
>
> On 11/15/2022 2:01 PM, Abel Wu wrote:
>> Hi Prateek, thanks very much for your detailed testing!
>>
>> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>>> Hello Abel,
>>>
>>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>>> (2 x 64C/128T)
>>>
>>> tl;dr
>>>
>>> o I do not notice any regressions with the standard benchmarks.
>>> o schbench sees a nice improvement to the tail latency when the number
>>> of worker are equal to the number of cores in the system in NPS1 and
>>> NPS2 mode. (Marked with "^")
>>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>> (Marked with "^")
>>>
>>> I'm still in the process of running larger workloads. If there is any
>>> specific workload you would like me to run on the test system, please
>>> do let me know. Below is the detailed report:
>>
>> Not particularly in my mind, and I think testing larger workloads is
>> great. Thanks!
>>
>>>
>>> Following are the results from running standard benchmarks on a
>>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>>> NPS modes.
>>>
>>> NPS Modes are used to logically divide single socket into
>>> multiple NUMA region.
>>> Following is the NUMA configuration for each NPS mode on the system:
>>>
>>> NPS1: Each socket is a NUMA node.
>>> Total 2 NUMA nodes in the dual socket machine.
>>>
>>> Node 0: 0-63, 128-191
>>> Node 1: 64-127, 192-255
>>>
>>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>> Total 4 NUMA nodes exist over 2 socket.
>>> Node 0: 0-31, 128-159
>>> Node 1: 32-63, 160-191
>>> Node 2: 64-95, 192-223
>>> Node 3: 96-127, 223-255
>>>
>>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>> Total 8 NUMA nodes exist over 2 socket.
>>> Node 0: 0-15, 128-143
>>> Node 1: 16-31, 144-159
>>> Node 2: 32-47, 160-175
>>> Node 3: 48-63, 176-191
>>> Node 4: 64-79, 192-207
>>> Node 5: 80-95, 208-223
>>> Node 6: 96-111, 223-231
>>> Node 7: 112-127, 232-255
>>>
>>> Benchmark Results:
>>>
>>> Kernel versions:
>>> - tip: 5.19.0 tip sched/core
>>> - sis_core: 5.19.0 tip sched/core + this series
>>>
>>> When we started testing, the tip was at:
>>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>>
>>> ~~~~~~~~~~~~~
>>> ~ hackbench ~
>>> ~~~~~~~~~~~~~
>>>
>>> o NPS1
>>>
>>> Test: tip sis_core
>>> 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) *
>>> 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run]
>>> 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct)
>>> 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct)
>>> 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct)
>>> 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct)
>>>
>>> o NPS2
>>>
>>> Test: tip sis_core
>>> 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct)
>>> 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct)
>>> 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct)
>>> 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct)
>>> 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct)
>>>
>>> o NPS4
>>>
>>> Test: tip sis_core
>>> 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct)
>>> 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct)
>>> 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct)
>>> 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct)
>>> 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct)
>>
>> Although each cpu will get 2.5 tasks when 16-groups, which can
>> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
>> the total cpu usage was ~82% (with some older kernel version),
>> so there is still lots of idle time.
>>
>> I guess cutting off at 16-groups is because it's enough loaded
>> compared to the real workloads, so testing more groups might just
>> be a waste of time?
>
> The machine has 16 LLCs so I capped the results at 16-groups.
> Previously I had seen some run-to-run variance with larger group counts
> so I limited the reports to 16-groups. I'll run hackbench with more
> number of groups (32, 64, 128, 256) and get back to you with the
> results along with results for a couple of long running workloads.
~~~~~~~~~~~~~
~ Hackbench ~
~~~~~~~~~~~~~
$ perf bench sched messaging -p -l 50000 -g <groups>
o NPS1
kernel: tip sis_core
32-groups: 6.20 (0.00 pct) 5.86 (5.48 pct)
64-groups: 16.55 (0.00 pct) 15.21 (8.09 pct)
128-groups: 42.57 (0.00 pct) 34.63 (18.65 pct)
256-groups: 71.69 (0.00 pct) 67.11 (6.38 pct)
512-groups: 108.48 (0.00 pct) 110.23 (-1.61 pct)
o NPS2
kernel: tip sis_core
32-groups: 6.56 (0.00 pct) 5.60 (14.63 pct)
64-groups: 15.74 (0.00 pct) 14.45 (8.19 pct)
128-groups: 39.93 (0.00 pct) 35.33 (11.52 pct)
256-groups: 74.49 (0.00 pct) 69.65 (6.49 pct)
512-groups: 112.22 (0.00 pct) 113.75 (-1.36 pct)
o NPS4:
kernel: tip sis_core
32-groups: 9.48 (0.00 pct) 5.64 (40.50 pct)
64-groups: 15.38 (0.00 pct) 14.13 (8.12 pct)
128-groups: 39.93 (0.00 pct) 34.47 (13.67 pct)
256-groups: 75.31 (0.00 pct) 67.98 (9.73 pct)
512-groups: 115.37 (0.00 pct) 111.15 (3.65 pct)
Note: Hackbench with 32-groups show run to run variation
on tip but is more stable with sis_core. Hackbench for
64-groups and beyond is stable on both kernels.
>
>>
>> Thanks & Best Regards,
>> Abel
>>
>> [..snip..]
>>
>
>
> --
> Thanks and Regards,
> Prateek
Apart from the couple of regressions in Unixbench, everything looks good.
If you would like me to get any more data for any workload on the test
system, please do let me know.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists