[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <7592f555-21f8-284a-dbc7-0a6ab4d42c0d@amd.com>
Date: Wed, 17 Apr 2024 14:18:46 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: David Vernet <void@...ifault.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
bsegall@...gle.com, mgorman@...e.de, bristot@...hat.com,
vschneid@...hat.com, youssefesmat@...gle.com, joelaf@...gle.com,
roman.gushchin@...ux.dev, yu.c.chen@...el.com, gautham.shenoy@....com,
aboorvad@...ux.vnet.ibm.com, wuyun.abel@...edance.com, tj@...nel.org,
kernel-team@...a.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 0/8] sched: Implement shared runqueue in fair.c
Hello David,
On 12/12/2023 6:01 AM, David Vernet wrote:
> This is v4 of the shared runqueue patchset. This patch set is based off
> of commit 418146e39891 ("freezer,sched: Clean saved_state when restoring
> it during thaw") on the sched/core branch of tip.git.
>
> In prior versions of this patch set, I was observing consistent and
> statistically significant wins for several benchmarks when this feature
> was enabled, such as kernel compile and hackbench. After rebasing onto
> the latest sched/core on tip.git, I'm no longer observing these wins,
> and in fact observe some performance loss with SHARED_RUNQ on hackbench.
> I ended up bisecting this to when EEVDF was merged.
>
> As I mentioned in [0], our plan for now is to take a step back and
> re-evaluate how we want to proceed with this patch set. That said, I did
> want to send this out in the interim in case it could be of interest to
> anyone else who would like to continue to experiment with it.
I was doing a bunch of testing prior to OSPM in case folks wanted to
discuss the results. Leaving the results of SHARED_RUNQ runs from a
recent-ish tip below.
tl;dr
- I haven't dug deeper into the regressions but the most prominent one
seems to be hackbench with lower number of groups but the picture
flips with higher number of groups.
Other benchmarks behave more or less similar to the tip. I'll leave the
full results below:
o System Details
- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode
o Kernels
tip: tip:sched/core at commit 8cec3dd9e593
("sched/core: Simplify code by removing
duplicate #ifdefs")
shared_runq: tip + this series
o Results
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) shared_runq[pct imp](CV)
1-groups 1.00 [ -0.00]( 1.80) 4.49 [-349.19](92.14)
2-groups 1.00 [ -0.00]( 1.76) 1.02 [ -2.17](19.20)
4-groups 1.00 [ -0.00]( 1.82) 0.86 [ 13.53]( 1.37)
8-groups 1.00 [ -0.00]( 1.40) 0.91 [ 8.73]( 2.39)
16-groups 1.00 [ -0.00]( 3.38) 0.91 [ 9.47]( 2.39)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) shared_runq[pct imp](CV)
1 1.00 [ 0.00]( 0.44) 1.00 [ -0.39]( 0.53)
2 1.00 [ 0.00]( 0.39) 1.00 [ -0.16]( 0.57)
4 1.00 [ 0.00]( 0.40) 1.00 [ -0.07]( 0.69)
8 1.00 [ 0.00]( 0.16) 0.99 [ -0.67]( 0.45)
16 1.00 [ 0.00]( 3.00) 1.03 [ 2.86]( 1.23)
32 1.00 [ 0.00]( 0.84) 1.00 [ -0.32]( 1.46)
64 1.00 [ 0.00]( 1.66) 0.98 [ -1.60]( 0.79)
128 1.00 [ 0.00]( 1.04) 1.01 [ 0.57]( 0.59)
256 1.00 [ 0.00]( 0.26) 0.98 [ -1.91]( 2.48)
512 1.00 [ 0.00]( 0.15) 1.00 [ 0.22]( 0.16)
1024 1.00 [ 0.00]( 0.20) 1.00 [ -0.37]( 0.02)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) shared_runq[pct imp](CV)
Copy 1.00 [ 0.00]( 6.19) 1.10 [ 9.51]( 4.30)
Scale 1.00 [ 0.00]( 6.47) 1.03 [ 2.90]( 2.82)
Add 1.00 [ 0.00]( 6.50) 1.04 [ 3.82]( 3.10)
Triad 1.00 [ 0.00]( 5.70) 1.01 [ 1.49]( 4.30)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) shared_runq[pct imp](CV)
Copy 1.00 [ 0.00]( 3.22) 1.04 [ 3.67]( 2.41)
Scale 1.00 [ 0.00]( 6.17) 1.03 [ 2.75]( 1.63)
Add 1.00 [ 0.00]( 5.12) 1.02 [ 2.42]( 2.10)
Triad 1.00 [ 0.00]( 2.29) 1.01 [ 1.11]( 1.59)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) shared_runq[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.17) 0.99 [ -0.65]( 0.40)
2-clients 1.00 [ 0.00]( 0.49) 1.00 [ -0.17]( 0.27)
4-clients 1.00 [ 0.00]( 0.65) 1.00 [ 0.09]( 0.69)
8-clients 1.00 [ 0.00]( 0.56) 1.00 [ -0.05]( 0.61)
16-clients 1.00 [ 0.00]( 0.78) 1.00 [ -0.23]( 0.58)
32-clients 1.00 [ 0.00]( 0.62) 0.98 [ -2.22]( 0.76)
64-clients 1.00 [ 0.00]( 1.41) 0.96 [ -3.75]( 1.19)
128-clients 1.00 [ 0.00]( 0.83) 0.98 [ -2.29]( 0.97)
256-clients 1.00 [ 0.00]( 4.60) 0.96 [ -4.18]( 3.02)
512-clients 1.00 [ 0.00](54.18) 0.99 [ -1.36](52.79)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) shared_runq[pct imp](CV)
1 1.00 [ -0.00](34.63) 1.40 [-40.00]( 2.38)
2 1.00 [ -0.00]( 2.70) 1.08 [ -8.11]( 7.53)
4 1.00 [ -0.00]( 4.70) 0.93 [ 6.67]( 7.16)
8 1.00 [ -0.00]( 5.09) 0.92 [ 7.55](10.20)
16 1.00 [ -0.00]( 5.08) 0.97 [ 3.39]( 2.00)
32 1.00 [ -0.00]( 2.91) 1.03 [ -3.33]( 2.22)
64 1.00 [ -0.00]( 2.73) 0.99 [ 1.04]( 3.43)
128 1.00 [ -0.00]( 7.89) 0.99 [ 0.69]( 9.65)
256 1.00 [ -0.00](28.55) 0.92 [ 7.94](19.85)
512 1.00 [ -0.00]( 2.11) 1.13 [-12.69]( 6.41)
==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip shared_runq (pct imp)
1CCD 1 1.00 1.01 (%diff: 1.45%)
2CCD 2 1.00 1.01 (%diff: 1.71%)
4CCD 4 1.00 1.01 (%diff: 1.66%)
8CCD 8 1.00 1.00 (%diff: 0.63%)
--
>
> [0]: https://lore.kernel.org/all/20231204193001.GA53255@maniforge/
>
> v1 (RFC): https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> v2: https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/
> v3: https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/
>
> [..snip..]
>
I'll take a deeper look at the regressions soon. I'll update the thread
if I find anything interesting in the meantime.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists