linux-kernel - Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3f98806b-fd74-cfba-b48c-2526109d10a3@linux.ibm.com>
Date:   Tue, 17 Oct 2023 15:19:24 +0530
From:   Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To:     Chen Yu <yu.c.chen@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>
Cc:     Tim Chen <tim.c.chen@...el.com>, Aaron Lu <aaron.lu@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        linux-kernel@...r.kernel.org, Chen Yu <yu.chen.surf@...il.com>
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during
 task wakeup

Hi Chen Yu,

On 26/09/23 10:40, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
>   burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>   (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
>   rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
>   for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>   (K Prateek Nayak)
> 
> Thanks for your comments and review!
> 
> ----------------------------------------------------------------------

Regarding making the scan for finding an idle cpu longer vs cache benefits, 
I ran some benchmarks.

Tested the patch on power system with 12 cores. Total of 96 CPU's.
System has two NUMA nodes.

Below are some of the benchmark results

schbench 99.0th latency (lower is better)
========
case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)


schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
for SIS_CACHE in case of 4-mthreads. I think we can ignore the last case due to huge run to run variations.

producer_consumer avg time/access (lower is better)
========
loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
5                  		1.00 [ 0.00]( 0.00)            0.87 [ +13.0]( 1.92)
20                   		1.00 [ 0.00]( 0.00)            0.92 [ +8.00]( 0.00)
50                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
100                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)

The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
mainly when loads per consumer iteration is lower.

hackbench normalized time in seconds (lower is better)
========
case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
process-pipe    1-groups     1.00 [ 0.00]( 1.50)            1.02 [ -2.00]( 3.36)
process-pipe    2-groups     1.00 [ 0.00]( 4.76)            0.99 [ +1.00]( 5.68)
process-sockets 1-groups     1.00 [ 0.00]( 2.56)            1.00 [  0.00]( 0.86)
process-sockets 2-groups     1.00 [ 0.00]( 0.50)            0.99 [ +1.00]( 0.96)
threads-pipe    1-groups     1.00 [ 0.00]( 3.87)            0.71 [ +29.0]( 3.56)
threads-pipe    2-groups     1.00 [ 0.00]( 1.60)            0.97 [ +3.00]( 3.44)
threads-sockets 1-groups     1.00 [ 0.00]( 7.65)            0.99 [ +1.00]( 1.05)
threads-sockets 2-groups     1.00 [ 0.00]( 3.12)            1.03 [ -3.00]( 1.70)

hackbench results are similar in both kernels except the case where there is an improvement of
29% in case of threads-pipe case with 1 groups.

Daytrader throughput (higher is better)
========

As per Ingo suggestion, ran a real life workload daytrader

baseline:
=================================================================================== 
 Instance      1
     Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
  ================       ===============   ===============   ===============
       10124.5 			    2 		    0 		   3970

SIS_CACHE:
===================================================================================
 Instance      1
     Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
  ================       ===============   ===============   ===============
       10319.5                       2               0              5771

In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.

Thanks and Regards 
Madadi Vineeth Reddy