linux-kernel - Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bf4407c1-890b-6a77-1e2a-d3d988f660ed@amd.com>
Date:   Fri, 18 Aug 2023 09:35:53 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Tejun Heo <tj@...nel.org>
Cc:     torvalds@...ux-foundation.org, jiangshanlai@...il.com,
        peterz@...radead.org, linux-kernel@...r.kernel.org,
        kernel-team@...a.com, joshdon@...gle.com, brho@...gle.com,
        briannorris@...omium.org, nhuck@...gle.com, agk@...hat.com,
        snitzer@...nel.org, void@...ifault.com
Subject: Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue
 execution locality

Hello Tejun,

On 8/8/2023 8:28 AM, K Prateek Nayak wrote:
> Hello Tejun,
> 
> On 8/8/2023 6:52 AM, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
>>> Unbound workqueues used to spray work items inside each NUMA node, which
>>> isn't great on CPUs w/ multiple L3 caches. This patchset implements
>>> mechanisms to improve and configure execution locality.
>>
>> The patchset shows minor perf improvements for some but more importantly
>> gives users more control over worker placement which helps working around
>> some of the recently reported performance regressions. Prateek reported
>> concerning regressions with tbench but I couldn't reproduce it and can't see
>> how tbench would be affected at all given the benchmark doesn't involve
>> workqueue operations in any noticeable way.
>>
>> Assuming that the tbench difference was a testing artifact, I'm applying the
>> patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
>> really appreciate if you could repeat the test and see whether the
>> difference persists.
> 
> Sure. I'll retest with for-6.6 branch. Will post the results here once the
> tests are done. I'll repeat the same - test with the defaults and the ones
> that show any difference in results, I'll rerun them with various affinity
> scopes.

Sorry I'm lagging on the test queue but following are the results of the
standard benchmarks running on a dual socket 3rd Generation EPYC system
(2 x 64C/128T)

tl;dr

- No noticeable difference in performance.
- The netperf and tbench regression are gone now and the base numbers too
  are much higher than before (sorry for the false alarm!)

Following are the results:

base:	affinity-scopes-v2 branch at commit 18c8ae813156 ("workqueue:
	Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us
	is 0")

affinity-scope:	affinity-scopes-v2 branch at commit a4da9f618d3e
	("workqueue: Add "Affinity Scopes and Performance" section to]
	documentation")

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:          base[pct imp](CV)    affinity-scope[pct imp](CV)
 1-groups     1.00 [ -0.00]( 1.76)     0.99 [  0.56]( 3.02)
 2-groups     1.00 [ -0.00]( 1.52)     1.01 [ -0.94]( 2.36)
 4-groups     1.00 [ -0.00]( 1.49)     1.02 [ -2.20]( 1.91)
 8-groups     1.00 [ -0.00]( 1.12)     1.00 [ -0.00]( 0.93)
16-groups     1.00 [ -0.00]( 3.64)     1.01 [ -0.87]( 2.66)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:  base[pct imp](CV)    affinity-scope[pct imp](CV)
    1     1.00 [  0.00]( 0.47)     1.00 [ -0.21]( 1.03)
    2     1.00 [  0.00]( 0.10)     1.00 [  0.00]( 0.45)
    4     1.00 [  0.00]( 1.60)     1.00 [ -0.18]( 0.83)
    8     1.00 [  0.00]( 0.13)     1.00 [ -0.26]( 0.59)
   16     1.00 [  0.00]( 1.69)     1.02 [  2.05]( 1.08)
   32     1.00 [  0.00]( 0.35)     1.00 [ -0.36]( 2.47)
   64     1.00 [  0.00]( 0.43)     1.00 [  0.45]( 2.54)
  128     1.00 [  0.00]( 0.31)     0.99 [ -0.82]( 0.58)
  256     1.00 [  0.00]( 1.81)     0.98 [ -1.84]( 1.80)
  512     1.00 [  0.00]( 0.54)     1.00 [  0.04]( 0.06)
 1024     1.00 [  0.00]( 0.13)     1.01 [  1.01]( 0.42)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     base[pct imp](CV)    affinity-scope[pct imp](CV)
 Copy     1.00 [  0.00]( 6.45)     1.03 [  2.50]( 5.75)
Scale     1.00 [  0.00]( 6.21)     1.03 [  3.36]( 0.75)
  Add     1.00 [  0.00]( 6.10)     1.04 [  4.23]( 1.81)
Triad     1.00 [  0.00]( 7.24)     1.03 [  3.49]( 3.41)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     base[pct imp](CV)    affinity-scope[pct imp](CV)
 Copy     1.00 [  0.00]( 1.98)     1.00 [  0.40]( 2.57)
Scale     1.00 [  0.00]( 4.88)     1.00 [ -0.07]( 5.11)
  Add     1.00 [  0.00]( 4.60)     1.00 [  0.23]( 5.21)
Triad     1.00 [  0.00]( 6.21)     1.03 [  2.85]( 2.55)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:       base[pct imp](CV)    affinity-scope[pct imp](CV)
 1-clients     1.00 [  0.00]( 1.84)     1.01 [  0.99]( 0.72)
 2-clients     1.00 [  0.00]( 0.64)     1.01 [  0.53]( 0.77)
 4-clients     1.00 [  0.00]( 0.75)     1.01 [  0.54]( 0.96)
 8-clients     1.00 [  0.00]( 0.83)     1.00 [ -0.21]( 1.03)
16-clients     1.00 [  0.00]( 0.75)     1.00 [  0.31]( 0.81)
32-clients     1.00 [  0.00]( 0.82)     1.00 [  0.12]( 1.57)
64-clients     1.00 [  0.00]( 2.30)     1.00 [ -0.28]( 2.39)
128-clients     1.00 [  0.00]( 2.54)     0.99 [ -1.01]( 2.61)
256-clients     1.00 [  0.00]( 4.37)     1.01 [  1.23]( 2.69)
512-clients     1.00 [  0.00](48.73)     1.01 [  0.99](46.07)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers: base[pct imp](CV)    affinity-scope[pct imp](CV)
  1     1.00 [ -0.00]( 2.28)     1.00 [ -0.00]( 2.28)
  2     1.00 [ -0.00]( 8.55)     0.96 [  4.00]( 4.17)
  4     1.00 [ -0.00]( 3.81)     0.94 [  6.45]( 8.78)
  8     1.00 [ -0.00]( 2.78)     0.97 [  2.78]( 4.81)
 16     1.00 [ -0.00]( 1.22)     0.96 [  4.26]( 1.27)
 32     1.00 [ -0.00]( 2.02)     0.97 [  2.63]( 3.99)
 64     1.00 [ -0.00]( 5.65)     0.99 [  0.62]( 1.65)
128     1.00 [ -0.00]( 5.17)     0.98 [  1.91]( 8.12)
256     1.00 [ -0.00](10.79)     1.07 [ -6.82]( 7.18)
512     1.00 [ -0.00]( 1.24)     0.99 [  0.54]( 1.37)



==================================================================
Test          : Unixbench
Units         : Various, Througput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================
               	 			base                affinity-scope
Hmean     unixbench-dhry2reg-1      40947261.77 (   0.00%)    41078213.81 (   0.32%)
Hmean     unixbench-dhry2reg-512  6243140251.68 (   0.00%)  6240938691.75 (  -0.04%)
Amean     unixbench-syscall-1        2932806.37 (   0.00%)     2871035.50 *   2.11%*
Amean     unixbench-syscall-512      7689448.00 (   0.00%)     8406697.27 *   9.33%*
Hmean     unixbench-pipe-1    	     2577667.42 (   0.00%)     2497979.59 *  -3.09%*
Hmean     unixbench-pipe-512	   363366036.45 (   0.00%)   356991588.20 *  -1.75%*
Hmean     unixbench-spawn-1             4446.97 (   0.00%)        4760.91 *   7.06%*
Hmean     unixbench-spawn-512          68983.49 (   0.00%)       68464.78 *  -0.75%*
Hmean     unixbench-execl-1             3894.20 (   0.00%)        3857.78 (  -0.94%)
Hmean     unixbench-execl-512          12716.76 (   0.00%)       13067.63 (   2.76%)


==================================================================
Test          : tbench (Various Affinity Scopes)
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:   base[pct imp](CV)         cpu[pct imp](CV)         smt[pct imp](CV)       cache[pct imp](CV)        numa[pct imp](CV)      system[pct imp](CV)
    1     1.00 [  0.00]( 0.47)     1.00 [  0.11]( 0.95)     1.00 [  0.23]( 1.97)     1.01 [  1.01]( 0.29)     1.00 [  0.07]( 0.57)     1.01 [  1.36]( 0.36)
    2     1.00 [  0.00]( 0.10)     1.01 [  1.14]( 0.27)     0.99 [ -0.84]( 0.51)     1.01 [  1.05]( 0.50)     1.00 [  0.24]( 0.75)     1.00 [ -0.29]( 1.22)
    4     1.00 [  0.00]( 1.60)     1.02 [  2.07]( 1.42)     1.02 [  1.65]( 0.46)     1.02 [  2.45]( 0.83)     1.00 [  0.36]( 1.33)     1.02 [  2.37]( 0.57)
    8     1.00 [  0.00]( 0.13)     1.00 [ -0.02]( 0.61)     1.00 [  0.14]( 0.57)     1.01 [  0.88]( 0.33)     1.00 [ -0.26]( 0.30)     1.01 [  0.90]( 1.48)
   16     1.00 [  0.00]( 1.69)     1.03 [  3.10]( 0.69)     1.04 [  3.66]( 1.36)     1.02 [  2.36]( 0.62)     1.02 [  1.61]( 1.63)     1.04 [  3.77]( 1.00)
   32     1.00 [  0.00]( 0.35)     0.97 [ -3.49]( 0.62)     0.97 [ -3.21]( 0.77)     1.00 [ -0.24]( 3.77)     0.96 [ -4.08]( 4.43)     0.97 [ -2.81]( 3.50)
   64     1.00 [  0.00]( 0.43)     1.00 [  0.20]( 1.66)     0.99 [ -0.61]( 0.81)     1.03 [  2.87]( 0.55)     1.02 [  2.16]( 2.31)     0.98 [ -2.32]( 3.63)
  128     1.00 [  0.00]( 0.31)     1.01 [  1.44]( 1.33)     1.01 [  0.72]( 0.46)     1.01 [  1.33]( 0.67)     1.00 [  0.38]( 0.58)     1.01 [  1.44]( 1.35)
  256     1.00 [  0.00]( 1.81)     0.98 [ -2.10]( 1.05)     0.97 [ -2.50]( 0.42)     0.97 [ -3.46]( 0.91)     0.99 [ -0.79]( 0.85)     0.96 [ -3.83]( 0.29)
  512     1.00 [  0.00]( 0.54)     1.00 [  0.37]( 1.12)     0.99 [ -1.33]( 0.44)     1.00 [ -0.19]( 0.94)     1.01 [  0.87]( 1.05)     0.99 [ -1.08]( 0.12)
 1024     1.00 [  0.00]( 0.13)     1.01 [  1.10]( 0.49)     1.00 [  0.47]( 0.28)     1.00 [  0.33]( 0.73)     1.00 [  0.48]( 0.69)     1.00 [  0.01]( 0.47)

==================================================================

ycsb-mongodb and DeathStarBench do not see any difference in
performance. I'll go and test more NPS modes / more machines.
Meanwhile, please feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@....com>

--
Thanks and Regards,
Prateek