linux-kernel - Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZS5rhO5XysGOUn4M@chenyu5-mobl2.ccr.corp.intel.com>
Date:   Tue, 17 Oct 2023 19:09:56 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     <cover.1695704179.git.yu.c.chen@...el.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Ingo Molnar <mingo@...hat.com>,
        "Vincent Guittot" <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>, Aaron Lu <aaron.lu@...el.com>,
        "Dietmar Eggemann" <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        "Daniel Bristot de Oliveira" <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
        <linux-kernel@...r.kernel.org>, Chen Yu <yu.chen.surf@...il.com>
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during
 task wakeup

Hi Madadi,

On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 26/09/23 10:40, Chen Yu wrote:
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> >   burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >   (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> >   rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> >   for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >   (K Prateek Nayak)
> > 
> > Thanks for your comments and review!
> > 
> > ----------------------------------------------------------------------
> 
> Regarding making the scan for finding an idle cpu longer vs cache benefits, 
> I ran some benchmarks.
> 

Thanks very much for your interest and your time on the patch.

> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> System has two NUMA nodes.
>
> Below are some of the benchmark results
> 
> schbench 99.0th latency (lower is better)
> ========
> case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
> normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
> normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
> normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
> normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)
> 
> 
> schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
> for SIS_CACHE in case of 4-mthreads.

The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.

> I think we can ignore the last case due to huge run to run variations.

Although the run-to-run variation is large, it seems that the decrease is within that range.
Prateek has also reported that when the system is overloaded there could be some regression
from schbench:
https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
latency in detail.
 
> producer_consumer avg time/access (lower is better)
> ========
> loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
> 5                  		1.00 [ 0.00]( 0.00)            0.87 [ +13.0]( 1.92)
> 20                   		1.00 [ 0.00]( 0.00)            0.92 [ +8.00]( 0.00)
> 50                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
> 100                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
> 
> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
> mainly when loads per consumer iteration is lower.
> 
> hackbench normalized time in seconds (lower is better)
> ========
> case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
> process-pipe    1-groups     1.00 [ 0.00]( 1.50)            1.02 [ -2.00]( 3.36)
> process-pipe    2-groups     1.00 [ 0.00]( 4.76)            0.99 [ +1.00]( 5.68)
> process-sockets 1-groups     1.00 [ 0.00]( 2.56)            1.00 [  0.00]( 0.86)
> process-sockets 2-groups     1.00 [ 0.00]( 0.50)            0.99 [ +1.00]( 0.96)
> threads-pipe    1-groups     1.00 [ 0.00]( 3.87)            0.71 [ +29.0]( 3.56)
> threads-pipe    2-groups     1.00 [ 0.00]( 1.60)            0.97 [ +3.00]( 3.44)
> threads-sockets 1-groups     1.00 [ 0.00]( 7.65)            0.99 [ +1.00]( 1.05)
> threads-sockets 2-groups     1.00 [ 0.00]( 3.12)            1.03 [ -3.00]( 1.70)
> 
> hackbench results are similar in both kernels except the case where there is an improvement of
> 29% in case of threads-pipe case with 1 groups.
> 
> Daytrader throughput (higher is better)
> ========
> 
> As per Ingo suggestion, ran a real life workload daytrader
> 
> baseline:
> =================================================================================== 
>  Instance      1
>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>   ================       ===============   ===============   ===============
>        10124.5 			    2 		    0 		   3970
> 
> SIS_CACHE:
> ===================================================================================
>  Instance      1
>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>   ================       ===============   ===============   ===============
>        10319.5                       2               0              5771
> 
> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>

Thanks for bringing this good news, a real life workload benefits from this change.
I'll tune this patch a little bit to address the regression from schbench. Also to mention
that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
platform benefit from this change.
https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/

thanks,
Chenyu