[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <28563e2d-6746-e2c4-7d21-4ca39a82edc1@amd.com>
Date: Thu, 12 Oct 2023 07:52:10 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Anna-Maria Behnsen <anna-maria@...utronix.de>,
linux-kernel@...r.kernel.org
Cc: Peter Zijlstra <peterz@...radead.org>,
John Stultz <jstultz@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>,
Eric Dumazet <edumazet@...gle.com>,
"Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
Arjan van de Ven <arjan@...radead.org>,
"Paul E . McKenney" <paulmck@...nel.org>,
Frederic Weisbecker <frederic@...nel.org>,
Rik van Riel <riel@...riel.com>,
Steven Rostedt <rostedt@...dmis.org>,
Sebastian Siewior <bigeasy@...utronix.de>,
Giovanni Gherdovich <ggherdovich@...e.cz>,
Lukasz Luba <lukasz.luba@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Srinivas Pandruvada <srinivas.pandruvada@...el.com>
Subject: Re: [PATCH v8 00/25] timer: Move from a push remote at enqueue to a
pull at expiry model
Hello Anna-Maria,
Happy to report I don't see any regression with this version of series.
I'll leave the detailed report below.
On 10/4/2023 6:04 PM, Anna-Maria Behnsen wrote:
> [..snip..]
>
> dbench test
> ^^^^^^^^^^^
>
> A dbench test starting X pairs of client servers are used to create load on
> the system. The measurable value is the throughput. The tests were executed
> on a zen3 machine. The base is the tip tree branch timers/core which is
> based on a v6.6-rc1.
>
> governor menu
>
> X pairs timers/core pull-model impact
> ----------------------------------------------
> 1 353.19 (0.19) 353.45 (0.30) 0.07%
> 2 700.10 (0.96) 687.00 (0.20) -1.87%
> 4 1329.37 (0.63) 1282.91 (0.64) -3.49%
> 8 2561.16 (1.28) 2493.56 (1.76) -2.64%
> 16 4959.96 (0.80) 4914.59 (0.64) -0.91%
> 32 9741.92 (3.44) 8979.83 (1.13) -7.82%
> 64 16535.40 (2.84) 16388.47 (4.02) -0.89%
> 128 22136.83 (2.42) 23174.50 (1.43) 4.69%
> 256 39256.77 (4.48) 38994.00 (0.39) -0.67%
> 512 36799.03 (1.83) 38091.10 (0.63) 3.51%
> 1024 32903.03 (0.86) 35370.70 (0.89) 7.50%
>
>
> governor teo
>
> X pairs timers/core pull-model impact
> ----------------------------------------------
> 1 350.83 (1.27) 352.45 (0.96) 0.46%
> 2 699.52 (0.85) 690.10 (0.54) -1.35%
> 4 1339.53 (1.99) 1294.71 (2.71) -3.35%
> 8 2574.10 (0.76) 2495.46 (1.97) -3.06%
> 16 4898.50 (1.74) 4783.06 (1.64) -2.36%
> 32 9115.50 (4.63) 9037.83 (1.58) -0.85%
> 64 16663.90 (3.80) 16042.00 (1.72) -3.73%
> 128 25044.93 (1.11) 23250.03 (1.08) -7.17%
> 256 38059.53 (1.70) 39658.57 (2.98) 4.20%
> 512 36369.30 (0.39) 38890.13 (0.36) 6.93%
> 1024 33956.83 (1.14) 35514.83 (0.29) 4.59%
o Machine details
- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)
o Kernel Details
- tip: tip:sched/core at commit 238437d88cea ("intel_idle: Add ibrs_off
module parameter to force-disable IBRS") + min_deadline fix
commit 8dafa9d0eb1a ("sched/eevdf: Fix min_deadline heap
integrity") from tip:sched/urgent
- timer-pull: tip + this series as is
o Benchmark Results
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) timer-pull[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.11) 0.99 [ 1.44]( 3.34)
2-groups 1.00 [ -0.00]( 1.31) 1.01 [ -0.93]( 1.57)
4-groups 1.00 [ -0.00]( 1.04) 1.00 [ 0.44]( 1.11)
8-groups 1.00 [ -0.00]( 1.34) 0.99 [ 1.29]( 1.34)
16-groups 1.00 [ -0.00]( 2.45) 1.00 [ -0.40]( 2.78)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) timer-pull[pct imp](CV)
1 1.00 [ 0.00]( 0.46) 1.01 [ 0.52]( 0.66)
2 1.00 [ 0.00]( 0.64) 0.99 [ -0.60]( 0.88)
4 1.00 [ 0.00]( 0.59) 0.99 [ -0.92]( 1.82)
8 1.00 [ 0.00]( 0.34) 1.00 [ -0.06]( 0.33)
16 1.00 [ 0.00]( 0.72) 0.99 [ -1.25]( 1.52)
32 1.00 [ 0.00]( 0.65) 0.98 [ -1.59]( 1.29)
64 1.00 [ 0.00]( 0.59) 0.99 [ -0.84]( 3.87)
128 1.00 [ 0.00]( 1.19) 1.00 [ 0.11]( 0.33)
256 1.00 [ 0.00]( 0.16) 1.01 [ 0.61]( 0.52)
512 1.00 [ 0.00]( 0.20) 1.01 [ 0.80]( 0.29)
1024 1.00 [ 0.00]( 0.06) 1.01 [ 1.06]( 0.59)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) timer-pull[pct imp](CV)
Copy 1.00 [ 0.00]( 6.04) 1.04 [ 4.31]( 3.71)
Scale 1.00 [ 0.00]( 5.44) 1.01 [ 0.57]( 5.63)
Add 1.00 [ 0.00]( 5.44) 1.01 [ 0.99]( 5.46)
Triad 1.00 [ 0.00]( 7.82) 1.04 [ 4.14]( 5.68)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) timer-pull[pct imp](CV)
Copy 1.00 [ 0.00]( 1.14) 1.00 [ 0.29]( 0.49)
Scale 1.00 [ 0.00]( 4.60) 1.03 [ 2.87]( 0.62)
Add 1.00 [ 0.00]( 4.91) 1.01 [ 1.36]( 1.34)
Triad 1.00 [ 0.00]( 0.60) 0.98 [ -1.50]( 4.24)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) timer-pull[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.61) 1.01 [ 1.25]( 0.48)
2-clients 1.00 [ 0.00]( 0.44) 1.00 [ 0.34]( 0.65)
4-clients 1.00 [ 0.00]( 0.75) 1.01 [ 0.98]( 1.26)
8-clients 1.00 [ 0.00]( 0.65) 1.01 [ 0.82]( 0.73)
16-clients 1.00 [ 0.00]( 0.49) 1.00 [ 0.37]( 0.99)
32-clients 1.00 [ 0.00]( 0.57) 0.98 [ -2.05]( 3.44)
64-clients 1.00 [ 0.00]( 1.67) 1.00 [ 0.00]( 1.74)
128-clients 1.00 [ 0.00]( 1.11) 1.01 [ 0.69]( 1.11)
256-clients 1.00 [ 0.00]( 2.64) 1.00 [ 0.00]( 3.79)
512-clients 1.00 [ 0.00](52.49) 1.00 [ 0.26](54.13)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) timer-pull[pct imp](CV)
1 1.00 [ -0.00]( 8.41) 0.59 [ 40.54](40.25)
2 1.00 [ -0.00]( 5.29) 0.93 [ 7.50]( 9.01)
4 1.00 [ -0.00]( 1.32) 0.91 [ 9.09](12.33)
8 1.00 [ -0.00]( 9.52) 1.00 [ -0.00](15.02)
16 1.00 [ -0.00]( 1.61) 1.03 [ -3.23]( 2.37)
32 1.00 [ -0.00]( 7.27) 0.92 [ 7.69]( 1.59)
64 1.00 [ -0.00]( 6.96) 1.12 [-11.56]( 1.20)
128 1.00 [ -0.00]( 3.41) 1.06 [ -6.49]( 3.73)
256 1.00 [ -0.00](32.95) 1.02 [ -2.48](28.66)
512 1.00 [ -0.00]( 3.20) 0.99 [ 0.71]( 3.22)
==================================================================
Test : ycsb-cassandra
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
metric tip timer-pull (%diff)
throughput 1.00 1.01 (%diff: 0.75%)
==================================================================
Test : ycsb-mondodb
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
metric tip timer-pull (%diff)
throughput 1.00 1.00 (%diff: -0.49%)
==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip timer-pull (%diff)
1CCD 1 1.00 1.01 (%diff: 0.75%)
2CCD 2 1.00 1.03 (%diff: 2.72%)
4CCD 4 1.00 1.00 (%diff: -0.28%)
8CCD 8 1.00 1.00 (%diff: 0.20%)
--
Thank you for debugging and helping fix the tbench regression.
If the series does not change drastically, feel free to add:
Tested-by: K Prateek Nayak <kprateek.nayak@....com>
>
>
>
> Ping Pong Oberservation
> ^^^^^^^^^^^^^^^^^^^^^^^
>
> During testing on a mostly idle machine a ping pong game could be observed:
> a process_timeout timer is expired remotely on a non idle CPU. Then the CPU
> where the schedule_timeout() was executed to enqueue the timer comes out of
> idle and restarts the timer using schedule_timeout() and goes back to idle
> again. This is due to the fair scheduler which tries to keep the task on
> the CPU which it previously executed on.
>
>
>
>
> Possible Next Steps
> ~~~~~~~~~~~~~~~~~~~
>
> Simple deferrable timers are no longer required as they can be converted to
> global timers. If a CPU goes idle, a formerly deferrable timer will not
> prevent the CPU to sleep as long as possible. Only the last migrator CPU
> has to take care of them. Deferrable timers with timer pinned flags needs
> to be expired on the specified CPU but must not prevent CPU from going
> idle. They require their own timer base which is never taken into account
> when calculating the next expiry time. This conversation and required
> cleanup will be done in a follow up series.
>
I'll keep an eye out for future versions for testing.
>
> [..snip..]
>
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists