[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZKW4374Xc6YrRrEW@slm.duckdns.org>
Date: Wed, 5 Jul 2023 08:39:27 -1000
From: Tejun Heo <tj@...nel.org>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Sandeep Dhavale <dhavale@...gle.com>, jiangshanlai@...il.com,
torvalds@...ux-foundation.org, peterz@...radead.org,
linux-kernel@...r.kernel.org, kernel-team@...a.com,
joshdon@...gle.com, brho@...gle.com, briannorris@...omium.org,
nhuck@...gle.com, agk@...hat.com, snitzer@...nel.org,
void@...ifault.com, kernel-team@...roid.com
Subject: Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods
Hello,
On Wed, Jul 05, 2023 at 12:34:48PM +0530, K Prateek Nayak wrote:
> - Apart from tbench and netperf, the rest of the benchmarks show no
> difference out of the box.
Just looking at the data, it's a bit difficult for me to judge. I suppose
most of differences are due to run-to-run variances? It'd be really useful
if the data contains standard deviation (whether historical or directly from
multiple runs).
> - SPECjbb2015 Multi-jVM sees small uplift to max-jOPS with certain
> affinity scopes.
>
> - tbench and netperf seem to be unhappy throughout. None of the affinity
> scopes seem to bring back the performance. I'll dig more into this.
Yeah, that seems pretty consistent.
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
>
> o NPS1
>
> - 10 Runs:
>
> Test: base affinity_scopes
> Copy: 245676.59 (0.00 pct) 333646.71 (35.80 pct)
> Scale: 206545.41 (0.00 pct) 205706.04 (-0.40 pct)
> Add: 213506.82 (0.00 pct) 236739.07 (10.88 pct)
> Triad: 217679.43 (0.00 pct) 249263.46 (14.50 pct)
>
> - 100 Runs:
>
> Test: base affinity_scopes
> Copy: 318060.91 (0.00 pct) 326025.89 (2.50 pct)
> Scale: 213943.40 (0.00 pct) 207647.37 (-2.94 pct)
> Add: 237892.53 (0.00 pct) 232164.59 (-2.40 pct)
> Triad: 245672.84 (0.00 pct) 246333.21 (0.26 pct)
>
> o NPS2
>
> - 10 Runs:
>
> Test: base affinity_scopes
> Copy: 296632.20 (0.00 pct) 291153.63 (-1.84 pct)
> Scale: 206193.90 (0.00 pct) 216368.42 (4.93 pct)
> Add: 240363.50 (0.00 pct) 245954.23 (2.32 pct)
> Triad: 242748.60 (0.00 pct) 238606.20 (-1.70 pct)
>
> - 100 Runs:
>
> Test: base affinity_scopes
> Copy: 322535.79 (0.00 pct) 315020.03 (-2.33 pct)
> Scale: 217723.56 (0.00 pct) 220172.32 (1.12 pct)
> Add: 248117.72 (0.00 pct) 250557.17 (0.98 pct)
> Triad: 257768.66 (0.00 pct) 248264.00 (-3.68 pct)
>
> o NPS4
>
> - 10 Runs:
>
> Test: base affinity_scopes
> Copy: 274067.54 (0.00 pct) 302804.77 (10.48 pct)
> Scale: 224944.53 (0.00 pct) 230112.39 (2.29 pct)
> Add: 229318.09 (0.00 pct) 241939.54 (5.50 pct)
> Triad: 230175.89 (0.00 pct) 253613.85 (10.18 pct)
>
> - 100 Runs:
>
> Test: base affinity_scopes
> Copy: 338922.96 (0.00 pct) 348183.65 (2.73 pct)
> Scale: 240262.45 (0.00 pct) 245939.67 (2.36 pct)
> Add: 256968.24 (0.00 pct) 260657.01 (1.43 pct)
> Triad: 262644.16 (0.00 pct) 262286.46 (-0.13 pct)
The differences seem more consistent and pronounced for this benchmark too.
Is this just expected variance for this benchmark?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ Benchmarks run with multiple affinity scope ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> o NPS1
>
> - tbench
>
> Clients: base cpu cache numa system
> 1 450.40 (0.00 pct) 459.44 (2.00 pct) 457.12 (1.49 pct) 456.36 (1.32 pct) 456.75 (1.40 pct)
> 2 872.50 (0.00 pct) 869.68 (-0.32 pct) 890.59 (2.07 pct) 878.87 (0.73 pct) 890.14 (2.02 pct)
> 4 1630.13 (0.00 pct) 1621.24 (-0.54 pct) 1634.74 (0.28 pct) 1628.62 (-0.09 pct) 1646.57 (1.00 pct)
> 8 3139.90 (0.00 pct) 3044.58 (-3.03 pct) 3099.49 (-1.28 pct) 3081.43 (-1.86 pct) 3151.16 (0.35 pct)
> 16 6113.51 (0.00 pct) 5555.17 (-9.13 pct) 5465.09 (-10.60 pct) 5661.31 (-7.39 pct) 5742.58 (-6.06 pct)
> 32 11024.64 (0.00 pct) 9574.62 (-13.15 pct) 9282.62 (-15.80 pct) 9542.00 (-13.44 pct) 9916.66 (-10.05 pct)
> 64 19081.96 (0.00 pct) 15656.53 (-17.95 pct) 15176.12 (-20.46 pct) 16527.77 (-13.38 pct) 15097.97 (-20.87 pct)
> 128 30956.07 (0.00 pct) 28277.80 (-8.65 pct) 27662.76 (-10.63 pct) 27817.94 (-10.13 pct) 28925.78 (-6.55 pct)
> 256 42829.46 (0.00 pct) 38646.48 (-9.76 pct) 38355.27 (-10.44 pct) 37073.24 (-13.43 pct) 34391.01 (-19.70 pct)
> 512 42395.69 (0.00 pct) 36931.34 (-12.88 pct) 39259.49 (-7.39 pct) 36571.62 (-13.73 pct) 36245.55 (-14.50 pct)
> 1024 41973.51 (0.00 pct) 38817.07 (-7.52 pct) 38733.15 (-7.72 pct) 38864.45 (-7.40 pct) 35728.70 (-14.87 pct)
>
> - netperf
>
> base cpu cache numa system
> 1-clients: 100910.82 (0.00 pct) 103440.72 (2.50 pct) 102592.36 (1.66 pct) 103199.49 (2.26 pct) 103561.90 (2.62 pct)
> 2-clients: 99777.76 (0.00 pct) 100414.00 (0.63 pct) 100305.89 (0.52 pct) 99890.90 (0.11 pct) 101512.46 (1.73 pct)
> 4-clients: 97676.17 (0.00 pct) 96624.28 (-1.07 pct) 95966.77 (-1.75 pct) 97105.22 (-0.58 pct) 97972.11 (0.30 pct)
> 8-clients: 95413.11 (0.00 pct) 89926.72 (-5.75 pct) 89977.14 (-5.69 pct) 91020.10 (-4.60 pct) 92458.94 (-3.09 pct)
> 16-clients: 88961.66 (0.00 pct) 81295.02 (-8.61 pct) 79144.83 (-11.03 pct) 80216.42 (-9.83 pct) 85439.68 (-3.95 pct)
> 32-clients: 82199.83 (0.00 pct) 77914.00 (-5.21 pct) 75055.66 (-8.69 pct) 76813.94 (-6.55 pct) 80768.87 (-1.74 pct)
> 64-clients: 66094.87 (0.00 pct) 64419.91 (-2.53 pct) 63718.37 (-3.59 pct) 60370.40 (-8.66 pct) 66179.58 (0.12 pct)
> 128-clients: 43833.63 (0.00 pct) 42936.08 (-2.04 pct) 44554.69 (1.64 pct) 42666.82 (-2.66 pct) 45543.69 (3.90 pct)
> 256-clients: 38917.58 (0.00 pct) 24807.28 (-36.25 pct) 20517.01 (-47.28 pct) 21651.40 (-44.36 pct) 23778.87 (-38.89 pct)
>
> - SPECjbb2015 Mutli-JVM
>
> max-jOPS critical-jOPS
> base: 0.00% 0.00%
> smt: -1.11% -1.84%
> cpu: 2.86% -1.35%
> cache: 2.86% -1.66%
> numa: 1.43% -1.49%
> system: 0.08% -0.41%
>
>
> I'll go dig deeper into the tbench and netperf regressions. I'm not sure
> why the regression is observed for all the affinity scopes. I'll look
> into IBS profile and see if something obvious pops up. Meanwhile if there
> is any specific data you would like me to collect or benchmark you would
> like me to test, let me know.
Yeah, that's a bit surprising given that in terms of affinity behavior
"numa" should be identical to base. The only meaningful differences that I
can think of is when the work item is assigned to its worker and maybe how
pwq max_active limit is applied. Hmm... can you monitor the number of
kworker kthreads while running the benchmark? No need to do the whole
matrix, just comparing base against numa should be enough.
Thanks.
--
tejun
Powered by blists - more mailing lists