[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eb45778d-3302-2ece-8d2e-319b1fcd071d@linux.ibm.com>
Date: Thu, 19 Oct 2023 01:02:16 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Chen Yu <yu.c.chen@...el.com>,
cover.1695704179.git.yu.c.chen@...el.com
Cc: Peter Zijlstra <peterz@...radead.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Ingo Molnar <mingo@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Tim Chen <tim.c.chen@...el.com>, Aaron Lu <aaron.lu@...el.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
linux-kernel@...r.kernel.org, Chen Yu <yu.chen.surf@...il.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during
task wakeup
Hi Chen Yu,
On 17/10/23 16:39, Chen Yu wrote:
> Hi Madadi,
>
> On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 26/09/23 10:40, Chen Yu wrote:
>>> RFC -> v1:
>>> - drop RFC
>>> - Only record the short sleeping time for each task, to better honor the
>>> burst sleeping tasks. (Mathieu Desnoyers)
>>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>>> (Mathieu Desnoyers, Aaron Lu)
>>> - Introduce a new helper function cache_hot_cpu() that considers
>>> rq->cache_hot_timeout. (Aaron Lu)
>>> - Add analysis of why inhibiting task migration could bring better throughput
>>> for some benchmarks. (Gautham R. Shenoy)
>>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>>> (K Prateek Nayak)
>>>
>>> Thanks for your comments and review!
>>>
>>> ----------------------------------------------------------------------
>>
>> Regarding making the scan for finding an idle cpu longer vs cache benefits,
>> I ran some benchmarks.
>>
>
> Thanks very much for your interest and your time on the patch.
>
>> Tested the patch on power system with 12 cores. Total of 96 CPU's.
>> System has two NUMA nodes.
>>
>> Below are some of the benchmark results
>>
>> schbench 99.0th latency (lower is better)
>> ========
>> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
>> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
>> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
>> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
>>
>>
>> schbench results are showing that there is not much impact in wakeup latencies due to more iterations
>> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
>> for SIS_CACHE in case of 4-mthreads.
>
> The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
>
>> I think we can ignore the last case due to huge run to run variations.
>
> Although the run-to-run variation is large, it seems that the decrease is within that range.
> Prateek has also reported that when the system is overloaded there could be some regression
> from schbench:
> https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
> Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> latency in detail.
>
raw data by schbench(old) with 6-mthreads
======================
Baseline (5 runs)
========
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 981
99.5000th: 4424
99.9000th: 9200
min=0, max=29497
Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 35
95.0000th: 38
*99.0000th: 495
99.5000th: 3924
99.9000th: 9872
min=0, max=29997
Latency percentiles (usec)
50.0000th: 23
75.0000th: 30
90.0000th: 36
95.0000th: 39
*99.0000th: 1326
99.5000th: 4744
99.9000th: 10000
min=0, max=23394
Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 55
99.5000th: 3292
99.9000th: 9104
min=0, max=25196
Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 711
99.5000th: 4600
99.9000th: 9424
min=0, max=19997
SIS_CACHE (5 runs)
=========
Latency percentiles (usec)
50.0000th: 23
75.0000th: 30
90.0000th: 35
95.0000th: 38
*99.0000th: 1894
99.5000th: 5464
99.9000th: 10000
min=0, max=19157
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 2396
99.5000th: 6664
99.9000th: 10000
min=0, max=24029
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 2132
99.5000th: 6296
99.9000th: 10000
min=0, max=25313
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 1090
99.5000th: 6232
99.9000th: 9744
min=0, max=27264
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 38
*99.0000th: 1786
99.5000th: 5240
99.9000th: 9968
min=0, max=24754
The above data as indicated has large run to run variation and in general, the latency is
high in case of SIS_CACHE for the 99th %ile.
schbench(new) with 6-mthreads
=============
Baseline
========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
50.0th: 8 (43672 samples)
90.0th: 13 (83908 samples)
* 99.0th: 20 (18323 samples)
99.9th: 775 (1785 samples)
min=1, max=8400
Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
50.0th: 13648 (59873 samples)
90.0th: 14000 (82767 samples)
* 99.0th: 14320 (16342 samples)
99.9th: 18720 (1670 samples)
min=5130, max=38334
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 6968 (8 samples)
* 50.0th: 6984 (23 samples)
90.0th: 6984 (0 samples)
min=6835, max=6991
average rps: 6984.77
SIS_CACHE
=========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
50.0th: 9 (49267 samples)
90.0th: 14 (86522 samples)
* 99.0th: 21 (14091 samples)
99.9th: 1146 (1722 samples)
min=1, max=10427
Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
50.0th: 13616 (62838 samples)
90.0th: 14000 (85301 samples)
* 99.0th: 14352 (16149 samples)
99.9th: 21408 (1660 samples)
min=5070, max=41866
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 6968 (7 samples)
* 50.0th: 6984 (21 samples)
90.0th: 6984 (0 samples)
min=6672, max=6996
average rps: 6981.07
In new schbench, I didn't observe run to run variation and also there was no regression
in case of SIS_CACHE for the 99th %ile.
>> producer_consumer avg time/access (lower is better)
>> ========
>> loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92)
>> 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00)
>> 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>> 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>>
>> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
>> mainly when loads per consumer iteration is lower.
>>
>> hackbench normalized time in seconds (lower is better)
>> ========
>> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36)
>> process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68)
>> process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86)
>> process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96)
>> threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56)
>> threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44)
>> threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05)
>> threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70)
>>
>> hackbench results are similar in both kernels except the case where there is an improvement of
>> 29% in case of threads-pipe case with 1 groups.
>>
>> Daytrader throughput (higher is better)
>> ========
>>
>> As per Ingo suggestion, ran a real life workload daytrader
>>
>> baseline:
>> ===================================================================================
>> Instance 1
>> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
>> ================ =============== =============== ===============
>> 10124.5 2 0 3970
>>
>> SIS_CACHE:
>> ===================================================================================
>> Instance 1
>> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
>> ================ =============== =============== ===============
>> 10319.5 2 0 5771
>>
>> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>>
>
> Thanks for bringing this good news, a real life workload benefits from this change.
> I'll tune this patch a little bit to address the regression from schbench. Also to mention
> that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
> CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
> platform benefit from this change.
> https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
Oh..ok. Thanks for the pointer!
>
> thanks,
> Chenyu
>
Thanks and Regards
Madadi Vineeth Reddy
Powered by blists - more mailing lists