[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1891aa6c-037f-46a1-9584-17aaa63e4e74@arm.com>
Date: Fri, 13 Oct 2023 12:35:42 +0100
From: Lukasz Luba <lukasz.luba@....com>
To: Anna-Maria Behnsen <anna-maria@...utronix.de>
Cc: Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, John Stultz <jstultz@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>,
Eric Dumazet <edumazet@...gle.com>,
"Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
Arjan van de Ven <arjan@...radead.org>,
"Paul E . McKenney" <paulmck@...nel.org>,
Frederic Weisbecker <frederic@...nel.org>,
Rik van Riel <riel@...riel.com>,
Steven Rostedt <rostedt@...dmis.org>,
Sebastian Siewior <bigeasy@...utronix.de>,
Giovanni Gherdovich <ggherdovich@...e.cz>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Srinivas Pandruvada <srinivas.pandruvada@...el.com>,
K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH v8 00/25] timer: Move from a push remote at enqueue to a
pull at expiry model
Hi Anna-Maria
On 10/4/23 13:34, Anna-Maria Behnsen wrote:
> Hi,
>
[snip]
>
>
> Testing
> ~~~~~~~
>
> Enqueue
> ^^^^^^^
>
> The impact of wasting cycles during enqueue by using the heuristic in
> contrast to always queuing the timer on the local CPU was measured with a
> micro benchmark. Therefore a timer is enqueued and dequeued in a loop with
> 1000 repetitions on a isolated CPU. The time the loop takes is measured. A
> quarter of the remaining CPUs was kept busy. This measurement was repeated
> several times. With the patch queue the average duration was reduced by
> approximately 25%.
>
> 145ns plain v6
> 109ns v6 with patch queue
>
>
> Furthermore the impact of residence in deep idle states of an idle system
> was investigated. The patch queue doesn't downgrade this behavior.
>
> dbench test
> ^^^^^^^^^^^
>
> A dbench test starting X pairs of client servers are used to create load on
> the system. The measurable value is the throughput. The tests were executed
> on a zen3 machine. The base is the tip tree branch timers/core which is
> based on a v6.6-rc1.
>
> governor menu
>
> X pairs timers/core pull-model impact
> ----------------------------------------------
> 1 353.19 (0.19) 353.45 (0.30) 0.07%
> 2 700.10 (0.96) 687.00 (0.20) -1.87%
> 4 1329.37 (0.63) 1282.91 (0.64) -3.49%
> 8 2561.16 (1.28) 2493.56 (1.76) -2.64%
> 16 4959.96 (0.80) 4914.59 (0.64) -0.91%
> 32 9741.92 (3.44) 8979.83 (1.13) -7.82%
> 64 16535.40 (2.84) 16388.47 (4.02) -0.89%
> 128 22136.83 (2.42) 23174.50 (1.43) 4.69%
> 256 39256.77 (4.48) 38994.00 (0.39) -0.67%
> 512 36799.03 (1.83) 38091.10 (0.63) 3.51%
> 1024 32903.03 (0.86) 35370.70 (0.89) 7.50%
>
>
> governor teo
>
> X pairs timers/core pull-model impact
> ----------------------------------------------
> 1 350.83 (1.27) 352.45 (0.96) 0.46%
> 2 699.52 (0.85) 690.10 (0.54) -1.35%
> 4 1339.53 (1.99) 1294.71 (2.71) -3.35%
> 8 2574.10 (0.76) 2495.46 (1.97) -3.06%
> 16 4898.50 (1.74) 4783.06 (1.64) -2.36%
> 32 9115.50 (4.63) 9037.83 (1.58) -0.85%
> 64 16663.90 (3.80) 16042.00 (1.72) -3.73%
> 128 25044.93 (1.11) 23250.03 (1.08) -7.17%
> 256 38059.53 (1.70) 39658.57 (2.98) 4.20%
> 512 36369.30 (0.39) 38890.13 (0.36) 6.93%
> 1024 33956.83 (1.14) 35514.83 (0.29) 4.59%
>
>
>
> Ping Pong Oberservation
> ^^^^^^^^^^^^^^^^^^^^^^^
>
> During testing on a mostly idle machine a ping pong game could be observed:
> a process_timeout timer is expired remotely on a non idle CPU. Then the CPU
> where the schedule_timeout() was executed to enqueue the timer comes out of
> idle and restarts the timer using schedule_timeout() and goes back to idle
> again. This is due to the fair scheduler which tries to keep the task on
> the CPU which it previously executed on.
>
>
I have tested this on my 2 Arm boards with mainline kernel
and almost-mainline. On both platforms it looks stable.
The results w/ your patchset looks better.
1. rockpi4b - mainline kernel (but no UI)
Limiting the cpumask for only 4 Little CPUs and setting
performance governor for cpufreq and menu for idle.
1.1. perf bench sched pipe
w/o patchset vs. w/ patchset
avg [ops/sec]:
(more is better)
23012.33 vs. 23154.33 (+0.6%)
avg [usecs/op]:
(less is better)
43.453 vs. 43.187 (-0.6%)
1.2. perf bench sched messaging
(less is better)
w/o patchset vs. w/ patchset
avg total time [s]:
2.7855 vs. 2.7005 (-3.1%)
2. pixel6 (kernel v5.18 with backported patchset)
2.1 Speedometer 2.0 (JS test running in Chrome browser)
w/o patchset vs. w/ patchset
149 vs. 146 (-2%)
2.2 Geekbench 5
(more is better)
Single core
w/o patchset vs. w/ patchset
1025 vs. 1017 (-0.7%)
Multi core
w/o patchset vs. w/ patchset
2756 vs. 2813 (+2%)
The performance looks good. Only one test 'Speedometer'
has some interesting lower score.
Fill free to add:
Tested-by: Lukasz Luba <lukasz.luba@....com>
Regards,
Lukasz
Powered by blists - more mailing lists