[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49e98b00-80c7-b3a4-30fd-bccb382d002b@oracle.com>
Date: Tue, 21 Nov 2017 23:23:18 -0600
From: Atish Patra <atish.patra@...cle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Joel Fernandes <joelaf@...gle.com>,
LKML <linux-kernel@...r.kernel.org>,
Brendan Jackman <brendan.jackman@....com>,
Josef Bacik <jbacik@...com>, Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race
window.
Here are the results of schbench(scheduler latency benchmark) and uperf
(networking benchmark).
Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
schbench config: message threads = 2; time = 180s, worker thread = variable
uperf config:ping pong test on loopback interface with message size = 8k
Overall, both benchmark seems to happiest when number of threads are
closer to number of cpus.
--------------------------------------------------------------------------------------------------------------------------
schbench Maximum Latency(lower is better):
Base(4.14) Base+pcpu
Num
Worker Mean Stdev Mean Stdev Improvement (%)
10 3026.8 4987.12 523 474.35 82.7210255055
18 13854.6 1841.61 12945.6 125.19 6.5609977913
19 16457 2046.51 12985.4 48.46 21.0949747828
20 14995 2368.84 15838 2038.82 -5.621873958
25 29952.2 107.72 29673.6 337.57 0.9301487036
30 30084 19.768 30096.2 7.782 -0.0405531179
-------------------------------------------------------------------------------------------------------------------
The proposed fix seems to improve the maximum latency for lower number
of threads. It also seems to reduce the variation(lower stdev) as well.
If number of threads are equal or higher than number of cpus, it results
in significantly
higher latencies in because of the nature of the benchmark. Results for
higher
threads use case are presented to provide a complete picture but it is
difficult to conclude
anything from that.
Next individual percentile results are present for each use case. The
proposed fix also
improves latency across all percentiles for configuration(19 worker
threads) which
should saturate the system.
---------------------------------------------------------------------------------------------------------------------
schbench Latency in usec(lower is better)
Baseline(4.14) Base+pcpu
Num
Worker Mean stdev Mean stdev Improvement(%)
50th
10 64.2 2.039 63.6 1.743 0.934
18 57.6 5.388 57 4.939 1.041
19 63 4.774 58 4 7.936
20 59.6 4.127 60.2 5.153 -1.006
25 78.4 0.489 78.2 0.748 0.255
30 96.2 0.748 96.4 1.019 -0.207
75th
10 72 3.033 71.6 2.939 0.555
18 78 2.097 77.2 2.135 1.025
19 81.6 1.2 79.4 0.8 2.696
20 81 1.264 80.4 2.332 0.740
25 109.6 1.019 110 0 -0.364
30 781.4 50.902 731.8 70.6382 6.3475
90th
10 80.4 3.666 80.6 2.576 -0.248
18 87.8 1.469 88 1.673 -0.227
19 92.8 0.979 90.6 0.489 2.370
20 92.6 1.019 92 2 0.647
25 8977.6 1277.160 9014.4 467.857 -0.409
30 9558.4 334.641 9507.2 320.383 0.5356
95th
10 86.8 3.867 87.6 4.409 -0.921
18 95.4 1.496 95.2 2.039 0.209
19 102.6 1.624 99 0.894 3.508
20 103.2 1.326 102.2 2.481 0.968
25 12400 78.383 12406.4 37.318 -0.051
30 12336 40.477 12310.4 12.8 0.207
99th
10 99.2 5.418 103.4 6.887 -4.233
18 115.2 2.561 114.6 3.611 0.5208
19 126.25 4.573 120.4 3.872 4.6336
20 145.4 3.09 133 1.41 8.5281
25 12988.8 15.676 12924.8 25.6 0.4927
30 12988.8 15.676 12956.8 32.633 0.2463
99.50th
10 104.4 5.161 109.8 7.909 -5.172
18 127.6 7.391 124.2 4.214 2.6645
19 2712.2 4772.883 133.6 5.571 95.074
20 3707.8 2831.954 2844.2 4708.345 23.291
25 14032 1283.834 13008 0 7.2976
30 16550.4 886.382 13840 1218.355 16.376
------------------------------------------------------------------------------------------------------------------------
Results from uperf
uperf config: Loopback ping pong test with message size = 8k
Baseline (4.14) Baseline +pcpu
Mean stdev Mean stdev Improvement(%)
1 9.056 0.02 8.966 0.083 -0.993
2 17.664 0.13 17.448 0.303 -1.222
4 32.03 0.22 31.972 0.129 -0.181
8 58.198 0.31 58.588 0.198 0.670
16 101.018 0.67 100.056 0.455 -0.952
32 148.1 15.41 164.494 2.312 11.069
64 203.66 1.16 203.042 1.348 -0.3073
128 197.12 1.04 194.722 1.174 -1.2165
The race window fix seems to help uperf for 32 threads (closest to
number of cpus) as well.
Regards,
Atish
On 11/04/2017 07:58 PM, Joel Fernandes wrote:
> Hi Peter,
>
> On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@...radead.org> wrote:
>> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
>>> Currently, multiple tasks can wakeup on same cpu from
>>> select_idle_sibiling() path in case they wakeup simulatenously
>>> and last ran on the same llc. This happens because an idle cpu
>>> is not updated until idle task is scheduled out. Any task waking
>>> during that period may potentially select that cpu for a wakeup
>>> candidate.
>>>
>>> Introduce a per cpu variable that is set as soon as a cpu is
>>> selected for wakeup for any task. This prevents from other tasks
>>> to select the same cpu again. Note: This does not close the race
>>> window but minimizes it to accessing the per-cpu variable. If two
>>> wakee tasks access the per cpu variable at the same time, they may
>>> select the same cpu again. But it minimizes the race window
>>> considerably.
>> The very most important question; does it actually help? What
>> benchmarks, give what numbers?
> I collected some numbers with an Android benchmark called Jankbench.
> Most tests didn't show an improvement or degradation with the patch.
> However, one of the tests called "list view", consistently shows an
> improvement. Particularly striking is the improvement at mean and 25
> percentile.
>
> For list_view test, Jankbench pulls up a list of text and scrolls the
> list, this exercises the display pipeline in Android to render and
> display the animation as the scroll happens. For Android, lower frame
> times is considered quite important as that means we are less likely
> to drop frames and give the user a good experience vs a perceivable
> poor experience.
>
> For each frame, Jankbench measures the total time a frame takes and
> stores it in a DB (the time from which the app starts drawing, to when
> the rendering completes and the frame is submitted for display).
> Following is the distribution of frame times in ms.
>
> count 16304 (@60 fps, 4.5 minutes)
>
> Without patch With patch
> mean 5.196633 4.429641 (+14.75%)
> std 2.030054 2.310025
> 25% 5.606810 1.991017 (+64.48%)
> 50% 5.824013 5.716631 (+1.84%)
> 75% 5.987102 5.932751 (+0.90%)
> 95% 6.461230 6.301318 (+2.47%)
> 99% 9.828959 9.697076 (+1.34%)
>
> Note that although Android uses energy aware scheduling patches, I
> turned those off to bring the test as close to mainline as possible. I
> also backported Vincent's and Brendan's slow path fixes to the 4.4
> kernel that the Pixel 2 uses.
>
> Personally I am in favor of this patch considering this test data but
> also that in the past, I remember that our teams had to deal with the
> same race issue and used cpusets to avoid it (although they probably
> tested with "energy aware" CPU selection kept on).
>
> thanks,
>
> - Joel
Powered by blists - more mailing lists