linux-kernel - Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZTELfNyUDAL0s36C@chenyu5-mobl2.ccr.corp.intel.com>
Date:   Thu, 19 Oct 2023 18:57:00 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
CC:     <cover.1695704179.git.yu.c.chen@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>, Aaron Lu <aaron.lu@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        "Steven Rostedt" <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        "Valentin Schneider" <vschneid@...hat.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        <linux-kernel@...r.kernel.org>, Chen Yu <yu.chen.surf@...il.com>,
        "Madadi Vineeth Reddy" <vineethr@...ux.ibm.com>
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during
 task wakeup

On 2023-10-19 at 01:02:16 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> On 17/10/23 16:39, Chen Yu wrote:
> > Hi Madadi,
> > 
> > On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> >> Hi Chen Yu,
> >>
> >> On 26/09/23 10:40, Chen Yu wrote:
> >>> RFC -> v1:
> >>> - drop RFC
> >>> - Only record the short sleeping time for each task, to better honor the
> >>>   burst sleeping tasks. (Mathieu Desnoyers)
> >>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >>>   (Mathieu Desnoyers, Aaron Lu)
> >>> - Introduce a new helper function cache_hot_cpu() that considers
> >>>   rq->cache_hot_timeout. (Aaron Lu)
> >>> - Add analysis of why inhibiting task migration could bring better throughput
> >>>   for some benchmarks. (Gautham R. Shenoy)
> >>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >>>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >>>   (K Prateek Nayak)
> >>>
> >>> Thanks for your comments and review!
> >>>
> >>> ----------------------------------------------------------------------
> >>
> >> Regarding making the scan for finding an idle cpu longer vs cache benefits, 
> >> I ran some benchmarks.
> >>
> > 
> > Thanks very much for your interest and your time on the patch.
> > 
> >> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> >> System has two NUMA nodes.
> >>
> >> Below are some of the benchmark results
> >>
> >> schbench 99.0th latency (lower is better)
> >> ========
> >> case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
> >> normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
> >> normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
> >> normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
> >> normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)
> >>
> >>
> >> schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
> >> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
> >> for SIS_CACHE in case of 4-mthreads.
> > 
> > The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
> > 
> >> I think we can ignore the last case due to huge run to run variations.
> > 
> > Although the run-to-run variation is large, it seems that the decrease is within that range.
> > Prateek has also reported that when the system is overloaded there could be some regression
> > from schbench:
> > https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
> > Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> > latency in detail.
> >  
> 
> raw data by schbench(old) with 6-mthreads
> ======================
> 
> Baseline (5 runs)
> ========
> Latency percentiles (usec)                                                                                                                                                                                                                                  
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 981 
>         99.5000th: 4424
>         99.9000th: 9200
>         min=0, max=29497
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 29
>         90.0000th: 35
>         95.0000th: 38
>         *99.0000th: 495 
>         99.5000th: 3924
>         99.9000th: 9872
>         min=0, max=29997
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 30
>         90.0000th: 36
>         95.0000th: 39
>         *99.0000th: 1326
>         99.5000th: 4744
>         99.9000th: 10000
>         min=0, max=23394
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 55
>         99.5000th: 3292
>         99.9000th: 9104
>         min=0, max=25196
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 711 
>         99.5000th: 4600
>         99.9000th: 9424
>         min=0, max=19997
> 
> SIS_CACHE (5 runs)
> =========
> Latency percentiles (usec)                                                                                                                                                                                                                                                                                     
>         50.0000th: 23
>         75.0000th: 30
>         90.0000th: 35
>         95.0000th: 38
>         *99.0000th: 1894
>         99.5000th: 5464
>         99.9000th: 10000
>         min=0, max=19157
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 2396
>         99.5000th: 6664
>         99.9000th: 10000
>         min=0, max=24029
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 2132
>         99.5000th: 6296
>         99.9000th: 10000
>         min=0, max=25313
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 1090
>         99.5000th: 6232
>         99.9000th: 9744
>         min=0, max=27264
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 38
>         *99.0000th: 1786
>         99.5000th: 5240
>         99.9000th: 9968
>         min=0, max=24754
> 
> The above data as indicated has large run to run variation and in general, the latency is
> high in case of SIS_CACHE for the 99th %ile.
> 
> 
> schbench(new) with 6-mthreads
> =============
> 
> Baseline
> ========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
> 	  50.0th: 8          (43672 samples)
> 	  90.0th: 13         (83908 samples)
> 	* 99.0th: 20         (18323 samples)
> 	  99.9th: 775        (1785 samples)
> 	  min=1, max=8400
> Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
> 	  50.0th: 13648      (59873 samples)
> 	  90.0th: 14000      (82767 samples)
> 	* 99.0th: 14320      (16342 samples)
> 	  99.9th: 18720      (1670 samples)
> 	  min=5130, max=38334
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 	  20.0th: 6968       (8 samples)
> 	* 50.0th: 6984       (23 samples)
> 	  90.0th: 6984       (0 samples)
> 	  min=6835, max=6991
> average rps: 6984.77
> 
> 
> SIS_CACHE
> =========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
> 	  50.0th: 9          (49267 samples)
> 	  90.0th: 14         (86522 samples)
> 	* 99.0th: 21         (14091 samples)
> 	  99.9th: 1146       (1722 samples)
> 	  min=1, max=10427
> Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
> 	  50.0th: 13616      (62838 samples)
> 	  90.0th: 14000      (85301 samples)
> 	* 99.0th: 14352      (16149 samples)
> 	  99.9th: 21408      (1660 samples)
> 	  min=5070, max=41866
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 	  20.0th: 6968       (7 samples)
> 	* 50.0th: 6984       (21 samples)
> 	  90.0th: 6984       (0 samples)
> 	  min=6672, max=6996
> average rps: 6981.07
> 
> In new schbench, I didn't observe run to run variation and also there was no regression
> in case of SIS_CACHE for the 99th %ile.
>

Thanks for the test Madadi, in my opinion we can stick with the new schbench
in the future. I'll have a double check on my test machine.

thanks,
Chenyu