linux-kernel - Re: [RFC][PATCH] sched: Cache aware load-balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <78508c06-e552-4022-8a4e-f777c15c7a90@intel.com>
Date: Wed, 26 Mar 2025 17:15:24 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Peter Zijlstra
	<peterz@...radead.org>
CC: <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
	<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
	<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<tim.c.chen@...ux.intel.com>, <tglx@...utronix.de>, <len.brown@...el.com>,
	<gautham.shenoy@....com>, <mingo@...nel.org>, <yu.chen.surf@...mail.com>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing


Hi Prateek,

On 3/26/2025 2:18 PM, K Prateek Nayak wrote:
> Hello Peter, Chenyu,
> 
> On 3/26/2025 12:14 AM, Peter Zijlstra wrote:
>> On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:
>>>
>>> Hi Peter,
>>>
>>> Thanks for sending this out,
>>>
>>> On 3/25/2025 8:09 PM, Peter Zijlstra wrote:
>>>> Hi all,
>>>>
>>>> One of the many things on the eternal todo list has been finishing the
>>>> below hackery.
>>>>
>>>> It is an attempt at modelling cache affinity -- and while the patch
>>>> really only targets LLC, it could very well be extended to also 
>>>> apply to
>>>> clusters (L2). Specifically any case of multiple cache domains inside a
>>>> node.
>>>>
>>>> Anyway, I wrote this about a year ago, and I mentioned this at the
>>>> recent OSPM conf where Gautham and Prateek expressed interest in 
>>>> playing
>>>> with this code.
>>>>
>>>> So here goes, very rough and largely unproven code ahead :-)
>>>>
>>>> It applies to current tip/master, but I know it will fail the __percpu
>>>> validation that sits in -next, although that shouldn't be terribly hard
>>>> to fix up.
>>>>
>>>> As is, it only computes a CPU inside the LLC that has the highest 
>>>> recent
>>>> runtime, this CPU is then used in the wake-up path to steer towards 
>>>> this
>>>> LLC and in task_hot() to limit migrations away from it.
>>>>
>>>> More elaborate things could be done, notably there is an XXX in there
>>>> somewhere about finding the best LLC inside a NODE (interaction with
>>>> NUMA_BALANCING).
>>>>
>>>
>>> Besides the control provided by CONFIG_SCHED_CACHE, could we also 
>>> introduce
>>> sched_feat(SCHED_CACHE) to manage this feature, facilitating dynamic
>>> adjustments? Similarly we can also introduce other sched_feats for load
>>> balancing and NUMA balancing for fine-grain control.
>>
>> We can do all sorts, but the very first thing is determining if this is
>> worth it at all. Because if we can't make this work at all, all those
>> things are a waste of time.
>>
>> This patch is not meant to be merged, it is meant for testing and
>> development. We need to first make it actually improve workloads. If it
>> then turns out it regresses workloads (likely, things always do), then
>> we can look at how to best do that.
>>
> 
> Thank you for sharing the patch and the initial review from Chenyu
> pointing to issues that need fixing. I'll try to take a good look at it
> this week and see if I can improve some trivial benchmarks that regress
> currently with RFC as is.
> 
> In its current form I think this suffers from the same problem as
> SIS_NODE where wakeups redirect to same set of CPUs and a good deal of
> additional work is being done without any benefit.
> 
> I'll leave the results from my initial testing on the 3rd Generation
> EPYC platform below and will evaluate what is making the benchmarks
> unhappy. I'll return with more data when some of these benchmarks
> are not as unhappy as they are now.
> 
> Thank you both for the RFC and the initial feedback. Following are
> the initial results for the RFC as is:
> 
>    ==================================================================
>    Test          : hackbench
>    Units         : Normalized time in seconds
>    Interpretation: Lower is better
>    Statistic     : AMean
>    ==================================================================
>    Case:           tip[pct imp](CV)      sched_cache[pct imp](CV)
>     1-groups     1.00 [ -0.00](10.12)     1.01 [ -0.89]( 2.84)
>     2-groups     1.00 [ -0.00]( 6.92)     1.83 [-83.15]( 1.61)
>     4-groups     1.00 [ -0.00]( 3.14)     3.00 [-200.21]( 3.13)
>     8-groups     1.00 [ -0.00]( 1.35)     3.44 [-243.75]( 2.20)
>    16-groups     1.00 [ -0.00]( 1.32)     2.59 [-158.98]( 4.29)
> 
> 
>    ==================================================================
>    Test          : tbench
>    Units         : Normalized throughput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:    tip[pct imp](CV)     sched_cache[pct imp](CV)
>        1     1.00 [  0.00]( 0.43)     0.96 [ -3.54]( 0.56)
>        2     1.00 [  0.00]( 0.58)     0.99 [ -1.32]( 1.40)
>        4     1.00 [  0.00]( 0.54)     0.98 [ -2.34]( 0.78)
>        8     1.00 [  0.00]( 0.49)     0.96 [ -3.91]( 0.54)
>       16     1.00 [  0.00]( 1.06)     0.97 [ -3.22]( 1.82)
>       32     1.00 [  0.00]( 1.27)     0.95 [ -4.74]( 2.05)
>       64     1.00 [  0.00]( 1.54)     0.93 [ -6.65]( 0.63)
>      128     1.00 [  0.00]( 0.38)     0.93 [ -6.91]( 1.18)
>      256     1.00 [  0.00]( 1.85)     0.99 [ -0.50]( 1.34)
>      512     1.00 [  0.00]( 0.31)     0.98 [ -2.47]( 0.14)
>     1024     1.00 [  0.00]( 0.19)     0.97 [ -3.06]( 0.39)
> 
> 
>    ==================================================================
>    Test          : stream-10
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
>     Copy     1.00 [  0.00](11.31)     0.34 [-65.89](72.77)
>    Scale     1.00 [  0.00]( 6.62)     0.32 [-68.09](72.49)
>      Add     1.00 [  0.00]( 7.06)     0.34 [-65.56](70.56)
>    Triad     1.00 [  0.00]( 8.91)     0.34 [-66.47](72.70)
> 
> 
>    ==================================================================
>    Test          : stream-100
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
>     Copy     1.00 [  0.00]( 2.01)     0.83 [-16.96](24.55)
>    Scale     1.00 [  0.00]( 1.49)     0.79 [-21.40](24.10)
>      Add     1.00 [  0.00]( 2.67)     0.79 [-21.33](25.39)
>    Triad     1.00 [  0.00]( 2.19)     0.81 [-19.19](25.55)
> 
> 
>    ==================================================================
>    Test          : netperf
>    Units         : Normalized Througput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:         tip[pct imp](CV)     sched_cache[pct imp](CV)
>     1-clients     1.00 [  0.00]( 1.43)     0.98 [ -2.22]( 0.26)
>     2-clients     1.00 [  0.00]( 1.02)     0.97 [ -2.55]( 0.89)
>     4-clients     1.00 [  0.00]( 0.83)     0.98 [ -2.27]( 0.46)
>     8-clients     1.00 [  0.00]( 0.73)     0.98 [ -2.45]( 0.80)
>    16-clients     1.00 [  0.00]( 0.97)     0.97 [ -2.90]( 0.88)
>    32-clients     1.00 [  0.00]( 0.88)     0.95 [ -5.29]( 1.69)
>    64-clients     1.00 [  0.00]( 1.49)     0.91 [ -8.70]( 1.95)
>    128-clients    1.00 [  0.00]( 1.05)     0.92 [ -8.39]( 4.25)
>    256-clients    1.00 [  0.00]( 3.85)     0.92 [ -8.33]( 2.45)
>    512-clients    1.00 [  0.00](59.63)     0.92 [ -7.83](51.19)
> 
> 
>    ==================================================================
>    Test          : schbench
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)       sched_cache[pct imp](CV)
>      1     1.00 [ -0.00]( 6.67)      0.38 [ 62.22]    ( 5.88)
>      2     1.00 [ -0.00](10.18)      0.43 [ 56.52]    ( 2.94)
>      4     1.00 [ -0.00]( 4.49)      0.60 [ 40.43]    ( 5.52)
>      8     1.00 [ -0.00]( 6.68)    113.96 [-11296.23] (12.91)
>     16     1.00 [ -0.00]( 1.87)    359.34 [-35834.43] (20.02)
>     32     1.00 [ -0.00]( 4.01)    217.67 [-21667.03] ( 5.48)
>     64     1.00 [ -0.00]( 3.21)     97.43 [-9643.02]  ( 4.61)
>    128     1.00 [ -0.00](44.13)     41.36 [-4036.10]  ( 6.92)
>    256     1.00 [ -0.00](14.46)      2.69 [-169.31]   ( 1.86)
>    512     1.00 [ -0.00]( 1.95)      1.89 [-89.22]    ( 2.24)
> 
> 
>    ==================================================================
>    Test          : new-schbench-requests-per-second
>    Units         : Normalized Requests per second
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      sched_cache[pct imp](CV)
>      1     1.00 [  0.00]( 0.46)     0.96 [ -4.14]( 0.00)
>      2     1.00 [  0.00]( 0.15)     0.95 [ -5.27]( 2.29)
>      4     1.00 [  0.00]( 0.15)     0.88 [-12.01]( 0.46)
>      8     1.00 [  0.00]( 0.15)     0.55 [-45.47]( 1.23)
>     16     1.00 [  0.00]( 0.00)     0.54 [-45.62]( 0.50)
>     32     1.00 [  0.00]( 3.40)     0.63 [-37.48]( 6.37)
>     64     1.00 [  0.00]( 7.09)     0.67 [-32.73]( 0.59)
>    128     1.00 [  0.00]( 0.00)     0.99 [ -0.76]( 0.34)
>    256     1.00 [  0.00]( 1.12)     1.06 [  6.32]( 1.55)
>    512     1.00 [  0.00]( 0.22)     1.06 [  6.08]( 0.92)
> 
> 
>    ==================================================================
>    Test          : new-schbench-wakeup-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)       sched_cache[pct imp](CV)
>      1     1.00 [ -0.00](19.72)     0.85  [ 15.38]    ( 8.13)
>      2     1.00 [ -0.00](15.96)     1.09  [ -9.09]    (18.20)
>      4     1.00 [ -0.00]( 3.87)     1.00  [ -0.00]    ( 0.00)
>      8     1.00 [ -0.00]( 8.15)    118.17 [-11716.67] ( 0.58)
>     16     1.00 [ -0.00]( 3.87)    146.62 [-14561.54] ( 4.64)
>     32     1.00 [ -0.00](12.99)    141.60 [-14060.00] ( 5.64)
>     64     1.00 [ -0.00]( 6.20)    78.62  [-7762.50]  ( 1.79)
>    128     1.00 [ -0.00]( 0.96)    11.36  [-1036.08]  ( 3.41)
>    256     1.00 [ -0.00]( 2.76)     1.11  [-11.22]    ( 3.28)
>    512     1.00 [ -0.00]( 0.20)     1.21  [-20.81]    ( 0.91)
> 
> 
>    ==================================================================
>    Test          : new-schbench-request-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      sched_cache[pct imp](CV)
>      1     1.00 [ -0.00]( 1.07)     1.11 [-10.66]  ( 2.76)
>      2     1.00 [ -0.00]( 0.14)     1.20 [-20.40]  ( 1.73)
>      4     1.00 [ -0.00]( 1.39)     2.04 [-104.20] ( 0.96)
>      8     1.00 [ -0.00]( 0.36)     3.94 [-294.20] ( 2.85)
>     16     1.00 [ -0.00]( 1.18)     4.56 [-356.16] ( 1.19)
>     32     1.00 [ -0.00]( 8.42)     3.02 [-201.67] ( 8.93)
>     64     1.00 [ -0.00]( 4.85)     1.51 [-51.38]  ( 0.80)
>    128     1.00 [ -0.00]( 0.28)     1.83 [-82.77]  ( 1.21)
>    256     1.00 [ -0.00](10.52)     1.43 [-43.11]  (10.67)
>    512     1.00 [ -0.00]( 0.69)     1.25 [-24.96]  ( 6.24)
> 
> 
>    ==================================================================
>    Test          : Various longer running benchmarks
>    Units         : %diff in throughput reported
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    Benchmarks:                 %diff
>    ycsb-cassandra             -10.70%
>    ycsb-mongodb               -13.66%
> 
>    deathstarbench-1x           13.87%
>    deathstarbench-2x            1.70%
>    deathstarbench-3x           -8.44%
>    deathstarbench-6x           -3.12%
> 
>    hammerdb+mysql 16VU        -33.50%
>    hammerdb+mysql 64VU        -33.22%
> 
> ---
> 
> I'm planning on taking hackbench and schbench as two extreme cases for
> throughput and tail latency and later look at Stream from a "high
> bandwidth, don't consolidate" standpoint. I hope once those cases
> aren't as much in the reds, the larger benchmarks will be happier too.
> 

Thanks for running the test. I think hackbenc/schbench would be the good 
benchmarks to start with. I remember that you and Gautham mentioned that 
schbench prefers to be aggregated in a single LLC in LPC2021 or 2022. I 
ran a schbench test using mmtests on a Xeon server which has 4 NUMA 
nodes. Each node has 80 cores (with SMT disabled). The numa=off option 
was appended to the boot commandline, so there are 4 "LLCs" within each 
node.


                                     BASELIN             SCHED_CACH
                                    BASELINE            SCHED_CACHE
Lat 50.0th-qrtle-1          8.00 (   0.00%)        5.00 (  37.50%)
Lat 90.0th-qrtle-1          9.00 (   0.00%)        5.00 (  44.44%)
Lat 99.0th-qrtle-1         13.00 (   0.00%)       10.00 (  23.08%)
Lat 99.9th-qrtle-1         21.00 (   0.00%)       19.00 (   9.52%)*
Lat 20.0th-qrtle-1        404.00 (   0.00%)      411.00 (  -1.73%)
Lat 50.0th-qrtle-2          8.00 (   0.00%)        5.00 (  37.50%)
Lat 90.0th-qrtle-2         11.00 (   0.00%)        8.00 (  27.27%)
Lat 99.0th-qrtle-2         16.00 (   0.00%)       11.00 (  31.25%)
Lat 99.9th-qrtle-2         27.00 (   0.00%)       17.00 (  37.04%)*
Lat 20.0th-qrtle-2        823.00 (   0.00%)      821.00 (   0.24%)
Lat 50.0th-qrtle-4         10.00 (   0.00%)        5.00 (  50.00%)
Lat 90.0th-qrtle-4         12.00 (   0.00%)        6.00 (  50.00%)
Lat 99.0th-qrtle-4         18.00 (   0.00%)        9.00 (  50.00%)
Lat 99.9th-qrtle-4         29.00 (   0.00%)       16.00 (  44.83%)*
Lat 20.0th-qrtle-4       1650.00 (   0.00%)     1598.00 (   3.15%)
Lat 50.0th-qrtle-8          9.00 (   0.00%)        4.00 (  55.56%)
Lat 90.0th-qrtle-8         11.00 (   0.00%)        6.00 (  45.45%)
Lat 99.0th-qrtle-8         16.00 (   0.00%)        9.00 (  43.75%)
Lat 99.9th-qrtle-8         28.00 (   0.00%)      188.00 (-571.43%)*
Lat 20.0th-qrtle-8       3316.00 (   0.00%)     3100.00 (   6.51%)
Lat 50.0th-qrtle-16        10.00 (   0.00%)        5.00 (  50.00%)
Lat 90.0th-qrtle-16        13.00 (   0.00%)        7.00 (  46.15%)
Lat 99.0th-qrtle-16        19.00 (   0.00%)       12.00 (  36.84%)
Lat 99.9th-qrtle-16        28.00 (   0.00%)     2034.00 (-7164.29%)*
Lat 20.0th-qrtle-16      6632.00 (   0.00%)     5800.00 (  12.55%)
Lat 50.0th-qrtle-32         7.00 (   0.00%)       12.00 ( -71.43%)
Lat 90.0th-qrtle-32        10.00 (   0.00%)       62.00 (-520.00%)
Lat 99.0th-qrtle-32        14.00 (   0.00%)      841.00 (-5907.14%)
Lat 99.9th-qrtle-32        23.00 (   0.00%)     1862.00 (-7995.65%)*
Lat 20.0th-qrtle-32     13264.00 (   0.00%)    10608.00 (  20.02%)
Lat 50.0th-qrtle-64         7.00 (   0.00%)       64.00 (-814.29%)
Lat 90.0th-qrtle-64        12.00 (   0.00%)      709.00 (-5808.33%)
Lat 99.0th-qrtle-64        18.00 (   0.00%)     2260.00 (-12455.56%)
Lat 99.9th-qrtle-64        26.00 (   0.00%)     3572.00 (-13638.46%)*
Lat 20.0th-qrtle-64     26528.00 (   0.00%)    14064.00 (  46.98%)
Lat 50.0th-qrtle-128        7.00 (   0.00%)      115.00 (-1542.86%)
Lat 90.0th-qrtle-128       11.00 (   0.00%)     1626.00 (-14681.82%)
Lat 99.0th-qrtle-128       17.00 (   0.00%)     4472.00 (-26205.88%)
Lat 99.9th-qrtle-128       27.00 (   0.00%)     8088.00 (-29855.56%)*
Lat 20.0th-qrtle-128    53184.00 (   0.00%)    17312.00 (  67.45%)
Lat 50.0th-qrtle-256      172.00 (   0.00%)      255.00 ( -48.26%)
Lat 90.0th-qrtle-256     2092.00 (   0.00%)     1482.00 (  29.16%)
Lat 99.0th-qrtle-256     2684.00 (   0.00%)     3148.00 ( -17.29%)
Lat 99.9th-qrtle-256     4504.00 (   0.00%)     6008.00 ( -33.39%)*
Lat 20.0th-qrtle-256    53056.00 (   0.00%)    48064.00 (   9.41%)
Lat 50.0th-qrtle-319      375.00 (   0.00%)      478.00 ( -27.47%)
Lat 90.0th-qrtle-319     2420.00 (   0.00%)     2244.00 (   7.27%)
Lat 99.0th-qrtle-319     4552.00 (   0.00%)     4456.00 (   2.11%)
Lat 99.9th-qrtle-319     6072.00 (   0.00%)     7656.00 ( -26.09%)*
Lat 20.0th-qrtle-319    47936.00 (   0.00%)    47808.00 (   0.27%)

We can see that, when the system is under-load, the 99.9th wakeup
latency improves. But when the system gets busier, say, from thread
number 8 to 319, the wakeup latency suffers.

The following change could mitigate the issue, which is intended to 
avoid task migration/stacking:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cddd67100a91..a492463aed71 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8801,6 +8801,7 @@ static long __migrate_degrades_locality(struct 
task_struct *p, int src_cpu, int
  static int select_cache_cpu(struct task_struct *p, int prev_cpu)
  {
         struct mm_struct *mm = p->mm;
+       struct sched_domain *sd;
         int cpu;

         if (!sched_feat(SCHED_CACHE))
@@ -8813,6 +8814,8 @@ static int select_cache_cpu(struct task_struct *p, 
int prev_cpu)
         if (cpu < 0)
                 return prev_cpu;

+       if (cpus_share_cache(prev_cpu, cpu))
+               return prev_cpu;

         if (static_branch_likely(&sched_numa_balancing) &&
             __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
@@ -8822,6 +8825,10 @@ static int select_cache_cpu(struct task_struct 
*p, int prev_cpu)
                 return prev_cpu;
         }

+       sd = rcu_dereference(per_cpu(sd_llc, cpu));
+       if (likely(sd))
+               return cpumask_any(sched_domain_span(sd));
+
         return cpu;
  }

                                  BASELINE_s          SCHED_CACHE_s
                                 BASELINE_sc         SCHED_CACHE_sc
Lat 50.0th-qrtle-1          5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-1          8.00 (   0.00%)        5.00 (  37.50%)
Lat 99.0th-qrtle-1         10.00 (   0.00%)       10.00 (   0.00%)
Lat 99.9th-qrtle-1         20.00 (   0.00%)       20.00 (   0.00%)*
Lat 20.0th-qrtle-1        409.00 (   0.00%)      406.00 (   0.73%)
Lat 50.0th-qrtle-2          8.00 (   0.00%)        4.00 (  50.00%)
Lat 90.0th-qrtle-2         11.00 (   0.00%)        5.00 (  54.55%)
Lat 99.0th-qrtle-2         16.00 (   0.00%)       11.00 (  31.25%)
Lat 99.9th-qrtle-2         29.00 (   0.00%)       16.00 (  44.83%)*
Lat 20.0th-qrtle-2        819.00 (   0.00%)      825.00 (  -0.73%)
Lat 50.0th-qrtle-4         10.00 (   0.00%)        4.00 (  60.00%)
Lat 90.0th-qrtle-4         12.00 (   0.00%)        4.00 (  66.67%)
Lat 99.0th-qrtle-4         18.00 (   0.00%)        6.00 (  66.67%)
Lat 99.9th-qrtle-4         30.00 (   0.00%)       15.00 (  50.00%)*
Lat 20.0th-qrtle-4       1658.00 (   0.00%)     1670.00 (  -0.72%)
Lat 50.0th-qrtle-8          9.00 (   0.00%)        3.00 (  66.67%)
Lat 90.0th-qrtle-8         11.00 (   0.00%)        4.00 (  63.64%)
Lat 99.0th-qrtle-8         16.00 (   0.00%)        6.00 (  62.50%)
Lat 99.9th-qrtle-8         29.00 (   0.00%)       13.00 (  55.17%)*
Lat 20.0th-qrtle-8       3308.00 (   0.00%)     3340.00 (  -0.97%)
Lat 50.0th-qrtle-16         9.00 (   0.00%)        4.00 (  55.56%)
Lat 90.0th-qrtle-16        12.00 (   0.00%)        4.00 (  66.67%)
Lat 99.0th-qrtle-16        18.00 (   0.00%)        6.00 (  66.67%)
Lat 99.9th-qrtle-16        31.00 (   0.00%)       12.00 (  61.29%)*
Lat 20.0th-qrtle-16      6616.00 (   0.00%)     6680.00 (  -0.97%)
Lat 50.0th-qrtle-32         8.00 (   0.00%)        4.00 (  50.00%)
Lat 90.0th-qrtle-32        11.00 (   0.00%)        5.00 (  54.55%)
Lat 99.0th-qrtle-32        17.00 (   0.00%)        8.00 (  52.94%)
Lat 99.9th-qrtle-32        27.00 (   0.00%)       11.00 (  59.26%)*
Lat 20.0th-qrtle-32     13296.00 (   0.00%)    13328.00 (  -0.24%)
Lat 50.0th-qrtle-64         9.00 (   0.00%)       46.00 (-411.11%)
Lat 90.0th-qrtle-64        14.00 (   0.00%)     1198.00 (-8457.14%)
Lat 99.0th-qrtle-64        20.00 (   0.00%)     2252.00 (-11160.00%)
Lat 99.9th-qrtle-64        31.00 (   0.00%)     2844.00 (-9074.19%)*
Lat 20.0th-qrtle-64     26528.00 (   0.00%)    15504.00 (  41.56%)
Lat 50.0th-qrtle-128        7.00 (   0.00%)       26.00 (-271.43%)
Lat 90.0th-qrtle-128       11.00 (   0.00%)     2244.00 (-20300.00%)
Lat 99.0th-qrtle-128       17.00 (   0.00%)     4488.00 (-26300.00%)
Lat 99.9th-qrtle-128       27.00 (   0.00%)     5752.00 (-21203.70%)*
Lat 20.0th-qrtle-128    53184.00 (   0.00%)    24544.00 (  53.85%)
Lat 50.0th-qrtle-256      172.00 (   0.00%)      135.00 (  21.51%)
Lat 90.0th-qrtle-256     2084.00 (   0.00%)     2022.00 (   2.98%)
Lat 99.0th-qrtle-256     2780.00 (   0.00%)     3908.00 ( -40.58%)
Lat 99.9th-qrtle-256     4536.00 (   0.00%)     5832.00 ( -28.57%)*
Lat 20.0th-qrtle-256    53568.00 (   0.00%)    51904.00 (   3.11%)
Lat 50.0th-qrtle-319      369.00 (   0.00%)      358.00 (   2.98%)
Lat 90.0th-qrtle-319     2428.00 (   0.00%)     2436.00 (  -0.33%)
Lat 99.0th-qrtle-319     4552.00 (   0.00%)     4664.00 (  -2.46%)
Lat 99.9th-qrtle-319     6104.00 (   0.00%)     6632.00 (  -8.65%)*
Lat 20.0th-qrtle-319    48192.00 (   0.00%)    48832.00 (  -1.33%)


We can see wakeup latency improvement in a wider range when running 
different number of threads. But there is still regression starting from 
thread number 64 - maybe the benefit of LLC locality is offset by the 
task stacking on 1 LLC. One possible direction I'm thinking of is that, 
we can get a snapshot of LLC status in load balance, check if the LLC is 
overloaded, if yes, do not enable this LLC aggregation during task 
wakeup - but do it in the load balancer, which is less frequent.

thanks,
Chenyu