[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <78508c06-e552-4022-8a4e-f777c15c7a90@intel.com>
Date: Wed, 26 Mar 2025 17:15:24 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Peter Zijlstra
<peterz@...radead.org>
CC: <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
<tim.c.chen@...ux.intel.com>, <tglx@...utronix.de>, <len.brown@...el.com>,
<gautham.shenoy@....com>, <mingo@...nel.org>, <yu.chen.surf@...mail.com>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing
Hi Prateek,
On 3/26/2025 2:18 PM, K Prateek Nayak wrote:
> Hello Peter, Chenyu,
>
> On 3/26/2025 12:14 AM, Peter Zijlstra wrote:
>> On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:
>>>
>>> Hi Peter,
>>>
>>> Thanks for sending this out,
>>>
>>> On 3/25/2025 8:09 PM, Peter Zijlstra wrote:
>>>> Hi all,
>>>>
>>>> One of the many things on the eternal todo list has been finishing the
>>>> below hackery.
>>>>
>>>> It is an attempt at modelling cache affinity -- and while the patch
>>>> really only targets LLC, it could very well be extended to also
>>>> apply to
>>>> clusters (L2). Specifically any case of multiple cache domains inside a
>>>> node.
>>>>
>>>> Anyway, I wrote this about a year ago, and I mentioned this at the
>>>> recent OSPM conf where Gautham and Prateek expressed interest in
>>>> playing
>>>> with this code.
>>>>
>>>> So here goes, very rough and largely unproven code ahead :-)
>>>>
>>>> It applies to current tip/master, but I know it will fail the __percpu
>>>> validation that sits in -next, although that shouldn't be terribly hard
>>>> to fix up.
>>>>
>>>> As is, it only computes a CPU inside the LLC that has the highest
>>>> recent
>>>> runtime, this CPU is then used in the wake-up path to steer towards
>>>> this
>>>> LLC and in task_hot() to limit migrations away from it.
>>>>
>>>> More elaborate things could be done, notably there is an XXX in there
>>>> somewhere about finding the best LLC inside a NODE (interaction with
>>>> NUMA_BALANCING).
>>>>
>>>
>>> Besides the control provided by CONFIG_SCHED_CACHE, could we also
>>> introduce
>>> sched_feat(SCHED_CACHE) to manage this feature, facilitating dynamic
>>> adjustments? Similarly we can also introduce other sched_feats for load
>>> balancing and NUMA balancing for fine-grain control.
>>
>> We can do all sorts, but the very first thing is determining if this is
>> worth it at all. Because if we can't make this work at all, all those
>> things are a waste of time.
>>
>> This patch is not meant to be merged, it is meant for testing and
>> development. We need to first make it actually improve workloads. If it
>> then turns out it regresses workloads (likely, things always do), then
>> we can look at how to best do that.
>>
>
> Thank you for sharing the patch and the initial review from Chenyu
> pointing to issues that need fixing. I'll try to take a good look at it
> this week and see if I can improve some trivial benchmarks that regress
> currently with RFC as is.
>
> In its current form I think this suffers from the same problem as
> SIS_NODE where wakeups redirect to same set of CPUs and a good deal of
> additional work is being done without any benefit.
>
> I'll leave the results from my initial testing on the 3rd Generation
> EPYC platform below and will evaluate what is making the benchmarks
> unhappy. I'll return with more data when some of these benchmarks
> are not as unhappy as they are now.
>
> Thank you both for the RFC and the initial feedback. Following are
> the initial results for the RFC as is:
>
> ==================================================================
> Test : hackbench
> Units : Normalized time in seconds
> Interpretation: Lower is better
> Statistic : AMean
> ==================================================================
> Case: tip[pct imp](CV) sched_cache[pct imp](CV)
> 1-groups 1.00 [ -0.00](10.12) 1.01 [ -0.89]( 2.84)
> 2-groups 1.00 [ -0.00]( 6.92) 1.83 [-83.15]( 1.61)
> 4-groups 1.00 [ -0.00]( 3.14) 3.00 [-200.21]( 3.13)
> 8-groups 1.00 [ -0.00]( 1.35) 3.44 [-243.75]( 2.20)
> 16-groups 1.00 [ -0.00]( 1.32) 2.59 [-158.98]( 4.29)
>
>
> ==================================================================
> Test : tbench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) sched_cache[pct imp](CV)
> 1 1.00 [ 0.00]( 0.43) 0.96 [ -3.54]( 0.56)
> 2 1.00 [ 0.00]( 0.58) 0.99 [ -1.32]( 1.40)
> 4 1.00 [ 0.00]( 0.54) 0.98 [ -2.34]( 0.78)
> 8 1.00 [ 0.00]( 0.49) 0.96 [ -3.91]( 0.54)
> 16 1.00 [ 0.00]( 1.06) 0.97 [ -3.22]( 1.82)
> 32 1.00 [ 0.00]( 1.27) 0.95 [ -4.74]( 2.05)
> 64 1.00 [ 0.00]( 1.54) 0.93 [ -6.65]( 0.63)
> 128 1.00 [ 0.00]( 0.38) 0.93 [ -6.91]( 1.18)
> 256 1.00 [ 0.00]( 1.85) 0.99 [ -0.50]( 1.34)
> 512 1.00 [ 0.00]( 0.31) 0.98 [ -2.47]( 0.14)
> 1024 1.00 [ 0.00]( 0.19) 0.97 [ -3.06]( 0.39)
>
>
> ==================================================================
> Test : stream-10
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) sched_cache[pct imp](CV)
> Copy 1.00 [ 0.00](11.31) 0.34 [-65.89](72.77)
> Scale 1.00 [ 0.00]( 6.62) 0.32 [-68.09](72.49)
> Add 1.00 [ 0.00]( 7.06) 0.34 [-65.56](70.56)
> Triad 1.00 [ 0.00]( 8.91) 0.34 [-66.47](72.70)
>
>
> ==================================================================
> Test : stream-100
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) sched_cache[pct imp](CV)
> Copy 1.00 [ 0.00]( 2.01) 0.83 [-16.96](24.55)
> Scale 1.00 [ 0.00]( 1.49) 0.79 [-21.40](24.10)
> Add 1.00 [ 0.00]( 2.67) 0.79 [-21.33](25.39)
> Triad 1.00 [ 0.00]( 2.19) 0.81 [-19.19](25.55)
>
>
> ==================================================================
> Test : netperf
> Units : Normalized Througput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) sched_cache[pct imp](CV)
> 1-clients 1.00 [ 0.00]( 1.43) 0.98 [ -2.22]( 0.26)
> 2-clients 1.00 [ 0.00]( 1.02) 0.97 [ -2.55]( 0.89)
> 4-clients 1.00 [ 0.00]( 0.83) 0.98 [ -2.27]( 0.46)
> 8-clients 1.00 [ 0.00]( 0.73) 0.98 [ -2.45]( 0.80)
> 16-clients 1.00 [ 0.00]( 0.97) 0.97 [ -2.90]( 0.88)
> 32-clients 1.00 [ 0.00]( 0.88) 0.95 [ -5.29]( 1.69)
> 64-clients 1.00 [ 0.00]( 1.49) 0.91 [ -8.70]( 1.95)
> 128-clients 1.00 [ 0.00]( 1.05) 0.92 [ -8.39]( 4.25)
> 256-clients 1.00 [ 0.00]( 3.85) 0.92 [ -8.33]( 2.45)
> 512-clients 1.00 [ 0.00](59.63) 0.92 [ -7.83](51.19)
>
>
> ==================================================================
> Test : schbench
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) sched_cache[pct imp](CV)
> 1 1.00 [ -0.00]( 6.67) 0.38 [ 62.22] ( 5.88)
> 2 1.00 [ -0.00](10.18) 0.43 [ 56.52] ( 2.94)
> 4 1.00 [ -0.00]( 4.49) 0.60 [ 40.43] ( 5.52)
> 8 1.00 [ -0.00]( 6.68) 113.96 [-11296.23] (12.91)
> 16 1.00 [ -0.00]( 1.87) 359.34 [-35834.43] (20.02)
> 32 1.00 [ -0.00]( 4.01) 217.67 [-21667.03] ( 5.48)
> 64 1.00 [ -0.00]( 3.21) 97.43 [-9643.02] ( 4.61)
> 128 1.00 [ -0.00](44.13) 41.36 [-4036.10] ( 6.92)
> 256 1.00 [ -0.00](14.46) 2.69 [-169.31] ( 1.86)
> 512 1.00 [ -0.00]( 1.95) 1.89 [-89.22] ( 2.24)
>
>
> ==================================================================
> Test : new-schbench-requests-per-second
> Units : Normalized Requests per second
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) sched_cache[pct imp](CV)
> 1 1.00 [ 0.00]( 0.46) 0.96 [ -4.14]( 0.00)
> 2 1.00 [ 0.00]( 0.15) 0.95 [ -5.27]( 2.29)
> 4 1.00 [ 0.00]( 0.15) 0.88 [-12.01]( 0.46)
> 8 1.00 [ 0.00]( 0.15) 0.55 [-45.47]( 1.23)
> 16 1.00 [ 0.00]( 0.00) 0.54 [-45.62]( 0.50)
> 32 1.00 [ 0.00]( 3.40) 0.63 [-37.48]( 6.37)
> 64 1.00 [ 0.00]( 7.09) 0.67 [-32.73]( 0.59)
> 128 1.00 [ 0.00]( 0.00) 0.99 [ -0.76]( 0.34)
> 256 1.00 [ 0.00]( 1.12) 1.06 [ 6.32]( 1.55)
> 512 1.00 [ 0.00]( 0.22) 1.06 [ 6.08]( 0.92)
>
>
> ==================================================================
> Test : new-schbench-wakeup-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) sched_cache[pct imp](CV)
> 1 1.00 [ -0.00](19.72) 0.85 [ 15.38] ( 8.13)
> 2 1.00 [ -0.00](15.96) 1.09 [ -9.09] (18.20)
> 4 1.00 [ -0.00]( 3.87) 1.00 [ -0.00] ( 0.00)
> 8 1.00 [ -0.00]( 8.15) 118.17 [-11716.67] ( 0.58)
> 16 1.00 [ -0.00]( 3.87) 146.62 [-14561.54] ( 4.64)
> 32 1.00 [ -0.00](12.99) 141.60 [-14060.00] ( 5.64)
> 64 1.00 [ -0.00]( 6.20) 78.62 [-7762.50] ( 1.79)
> 128 1.00 [ -0.00]( 0.96) 11.36 [-1036.08] ( 3.41)
> 256 1.00 [ -0.00]( 2.76) 1.11 [-11.22] ( 3.28)
> 512 1.00 [ -0.00]( 0.20) 1.21 [-20.81] ( 0.91)
>
>
> ==================================================================
> Test : new-schbench-request-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) sched_cache[pct imp](CV)
> 1 1.00 [ -0.00]( 1.07) 1.11 [-10.66] ( 2.76)
> 2 1.00 [ -0.00]( 0.14) 1.20 [-20.40] ( 1.73)
> 4 1.00 [ -0.00]( 1.39) 2.04 [-104.20] ( 0.96)
> 8 1.00 [ -0.00]( 0.36) 3.94 [-294.20] ( 2.85)
> 16 1.00 [ -0.00]( 1.18) 4.56 [-356.16] ( 1.19)
> 32 1.00 [ -0.00]( 8.42) 3.02 [-201.67] ( 8.93)
> 64 1.00 [ -0.00]( 4.85) 1.51 [-51.38] ( 0.80)
> 128 1.00 [ -0.00]( 0.28) 1.83 [-82.77] ( 1.21)
> 256 1.00 [ -0.00](10.52) 1.43 [-43.11] (10.67)
> 512 1.00 [ -0.00]( 0.69) 1.25 [-24.96] ( 6.24)
>
>
> ==================================================================
> Test : Various longer running benchmarks
> Units : %diff in throughput reported
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> Benchmarks: %diff
> ycsb-cassandra -10.70%
> ycsb-mongodb -13.66%
>
> deathstarbench-1x 13.87%
> deathstarbench-2x 1.70%
> deathstarbench-3x -8.44%
> deathstarbench-6x -3.12%
>
> hammerdb+mysql 16VU -33.50%
> hammerdb+mysql 64VU -33.22%
>
> ---
>
> I'm planning on taking hackbench and schbench as two extreme cases for
> throughput and tail latency and later look at Stream from a "high
> bandwidth, don't consolidate" standpoint. I hope once those cases
> aren't as much in the reds, the larger benchmarks will be happier too.
>
Thanks for running the test. I think hackbenc/schbench would be the good
benchmarks to start with. I remember that you and Gautham mentioned that
schbench prefers to be aggregated in a single LLC in LPC2021 or 2022. I
ran a schbench test using mmtests on a Xeon server which has 4 NUMA
nodes. Each node has 80 cores (with SMT disabled). The numa=off option
was appended to the boot commandline, so there are 4 "LLCs" within each
node.
BASELIN SCHED_CACH
BASELINE SCHED_CACHE
Lat 50.0th-qrtle-1 8.00 ( 0.00%) 5.00 ( 37.50%)
Lat 90.0th-qrtle-1 9.00 ( 0.00%) 5.00 ( 44.44%)
Lat 99.0th-qrtle-1 13.00 ( 0.00%) 10.00 ( 23.08%)
Lat 99.9th-qrtle-1 21.00 ( 0.00%) 19.00 ( 9.52%)*
Lat 20.0th-qrtle-1 404.00 ( 0.00%) 411.00 ( -1.73%)
Lat 50.0th-qrtle-2 8.00 ( 0.00%) 5.00 ( 37.50%)
Lat 90.0th-qrtle-2 11.00 ( 0.00%) 8.00 ( 27.27%)
Lat 99.0th-qrtle-2 16.00 ( 0.00%) 11.00 ( 31.25%)
Lat 99.9th-qrtle-2 27.00 ( 0.00%) 17.00 ( 37.04%)*
Lat 20.0th-qrtle-2 823.00 ( 0.00%) 821.00 ( 0.24%)
Lat 50.0th-qrtle-4 10.00 ( 0.00%) 5.00 ( 50.00%)
Lat 90.0th-qrtle-4 12.00 ( 0.00%) 6.00 ( 50.00%)
Lat 99.0th-qrtle-4 18.00 ( 0.00%) 9.00 ( 50.00%)
Lat 99.9th-qrtle-4 29.00 ( 0.00%) 16.00 ( 44.83%)*
Lat 20.0th-qrtle-4 1650.00 ( 0.00%) 1598.00 ( 3.15%)
Lat 50.0th-qrtle-8 9.00 ( 0.00%) 4.00 ( 55.56%)
Lat 90.0th-qrtle-8 11.00 ( 0.00%) 6.00 ( 45.45%)
Lat 99.0th-qrtle-8 16.00 ( 0.00%) 9.00 ( 43.75%)
Lat 99.9th-qrtle-8 28.00 ( 0.00%) 188.00 (-571.43%)*
Lat 20.0th-qrtle-8 3316.00 ( 0.00%) 3100.00 ( 6.51%)
Lat 50.0th-qrtle-16 10.00 ( 0.00%) 5.00 ( 50.00%)
Lat 90.0th-qrtle-16 13.00 ( 0.00%) 7.00 ( 46.15%)
Lat 99.0th-qrtle-16 19.00 ( 0.00%) 12.00 ( 36.84%)
Lat 99.9th-qrtle-16 28.00 ( 0.00%) 2034.00 (-7164.29%)*
Lat 20.0th-qrtle-16 6632.00 ( 0.00%) 5800.00 ( 12.55%)
Lat 50.0th-qrtle-32 7.00 ( 0.00%) 12.00 ( -71.43%)
Lat 90.0th-qrtle-32 10.00 ( 0.00%) 62.00 (-520.00%)
Lat 99.0th-qrtle-32 14.00 ( 0.00%) 841.00 (-5907.14%)
Lat 99.9th-qrtle-32 23.00 ( 0.00%) 1862.00 (-7995.65%)*
Lat 20.0th-qrtle-32 13264.00 ( 0.00%) 10608.00 ( 20.02%)
Lat 50.0th-qrtle-64 7.00 ( 0.00%) 64.00 (-814.29%)
Lat 90.0th-qrtle-64 12.00 ( 0.00%) 709.00 (-5808.33%)
Lat 99.0th-qrtle-64 18.00 ( 0.00%) 2260.00 (-12455.56%)
Lat 99.9th-qrtle-64 26.00 ( 0.00%) 3572.00 (-13638.46%)*
Lat 20.0th-qrtle-64 26528.00 ( 0.00%) 14064.00 ( 46.98%)
Lat 50.0th-qrtle-128 7.00 ( 0.00%) 115.00 (-1542.86%)
Lat 90.0th-qrtle-128 11.00 ( 0.00%) 1626.00 (-14681.82%)
Lat 99.0th-qrtle-128 17.00 ( 0.00%) 4472.00 (-26205.88%)
Lat 99.9th-qrtle-128 27.00 ( 0.00%) 8088.00 (-29855.56%)*
Lat 20.0th-qrtle-128 53184.00 ( 0.00%) 17312.00 ( 67.45%)
Lat 50.0th-qrtle-256 172.00 ( 0.00%) 255.00 ( -48.26%)
Lat 90.0th-qrtle-256 2092.00 ( 0.00%) 1482.00 ( 29.16%)
Lat 99.0th-qrtle-256 2684.00 ( 0.00%) 3148.00 ( -17.29%)
Lat 99.9th-qrtle-256 4504.00 ( 0.00%) 6008.00 ( -33.39%)*
Lat 20.0th-qrtle-256 53056.00 ( 0.00%) 48064.00 ( 9.41%)
Lat 50.0th-qrtle-319 375.00 ( 0.00%) 478.00 ( -27.47%)
Lat 90.0th-qrtle-319 2420.00 ( 0.00%) 2244.00 ( 7.27%)
Lat 99.0th-qrtle-319 4552.00 ( 0.00%) 4456.00 ( 2.11%)
Lat 99.9th-qrtle-319 6072.00 ( 0.00%) 7656.00 ( -26.09%)*
Lat 20.0th-qrtle-319 47936.00 ( 0.00%) 47808.00 ( 0.27%)
We can see that, when the system is under-load, the 99.9th wakeup
latency improves. But when the system gets busier, say, from thread
number 8 to 319, the wakeup latency suffers.
The following change could mitigate the issue, which is intended to
avoid task migration/stacking:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cddd67100a91..a492463aed71 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8801,6 +8801,7 @@ static long __migrate_degrades_locality(struct
task_struct *p, int src_cpu, int
static int select_cache_cpu(struct task_struct *p, int prev_cpu)
{
struct mm_struct *mm = p->mm;
+ struct sched_domain *sd;
int cpu;
if (!sched_feat(SCHED_CACHE))
@@ -8813,6 +8814,8 @@ static int select_cache_cpu(struct task_struct *p,
int prev_cpu)
if (cpu < 0)
return prev_cpu;
+ if (cpus_share_cache(prev_cpu, cpu))
+ return prev_cpu;
if (static_branch_likely(&sched_numa_balancing) &&
__migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
@@ -8822,6 +8825,10 @@ static int select_cache_cpu(struct task_struct
*p, int prev_cpu)
return prev_cpu;
}
+ sd = rcu_dereference(per_cpu(sd_llc, cpu));
+ if (likely(sd))
+ return cpumask_any(sched_domain_span(sd));
+
return cpu;
}
BASELINE_s SCHED_CACHE_s
BASELINE_sc SCHED_CACHE_sc
Lat 50.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
Lat 90.0th-qrtle-1 8.00 ( 0.00%) 5.00 ( 37.50%)
Lat 99.0th-qrtle-1 10.00 ( 0.00%) 10.00 ( 0.00%)
Lat 99.9th-qrtle-1 20.00 ( 0.00%) 20.00 ( 0.00%)*
Lat 20.0th-qrtle-1 409.00 ( 0.00%) 406.00 ( 0.73%)
Lat 50.0th-qrtle-2 8.00 ( 0.00%) 4.00 ( 50.00%)
Lat 90.0th-qrtle-2 11.00 ( 0.00%) 5.00 ( 54.55%)
Lat 99.0th-qrtle-2 16.00 ( 0.00%) 11.00 ( 31.25%)
Lat 99.9th-qrtle-2 29.00 ( 0.00%) 16.00 ( 44.83%)*
Lat 20.0th-qrtle-2 819.00 ( 0.00%) 825.00 ( -0.73%)
Lat 50.0th-qrtle-4 10.00 ( 0.00%) 4.00 ( 60.00%)
Lat 90.0th-qrtle-4 12.00 ( 0.00%) 4.00 ( 66.67%)
Lat 99.0th-qrtle-4 18.00 ( 0.00%) 6.00 ( 66.67%)
Lat 99.9th-qrtle-4 30.00 ( 0.00%) 15.00 ( 50.00%)*
Lat 20.0th-qrtle-4 1658.00 ( 0.00%) 1670.00 ( -0.72%)
Lat 50.0th-qrtle-8 9.00 ( 0.00%) 3.00 ( 66.67%)
Lat 90.0th-qrtle-8 11.00 ( 0.00%) 4.00 ( 63.64%)
Lat 99.0th-qrtle-8 16.00 ( 0.00%) 6.00 ( 62.50%)
Lat 99.9th-qrtle-8 29.00 ( 0.00%) 13.00 ( 55.17%)*
Lat 20.0th-qrtle-8 3308.00 ( 0.00%) 3340.00 ( -0.97%)
Lat 50.0th-qrtle-16 9.00 ( 0.00%) 4.00 ( 55.56%)
Lat 90.0th-qrtle-16 12.00 ( 0.00%) 4.00 ( 66.67%)
Lat 99.0th-qrtle-16 18.00 ( 0.00%) 6.00 ( 66.67%)
Lat 99.9th-qrtle-16 31.00 ( 0.00%) 12.00 ( 61.29%)*
Lat 20.0th-qrtle-16 6616.00 ( 0.00%) 6680.00 ( -0.97%)
Lat 50.0th-qrtle-32 8.00 ( 0.00%) 4.00 ( 50.00%)
Lat 90.0th-qrtle-32 11.00 ( 0.00%) 5.00 ( 54.55%)
Lat 99.0th-qrtle-32 17.00 ( 0.00%) 8.00 ( 52.94%)
Lat 99.9th-qrtle-32 27.00 ( 0.00%) 11.00 ( 59.26%)*
Lat 20.0th-qrtle-32 13296.00 ( 0.00%) 13328.00 ( -0.24%)
Lat 50.0th-qrtle-64 9.00 ( 0.00%) 46.00 (-411.11%)
Lat 90.0th-qrtle-64 14.00 ( 0.00%) 1198.00 (-8457.14%)
Lat 99.0th-qrtle-64 20.00 ( 0.00%) 2252.00 (-11160.00%)
Lat 99.9th-qrtle-64 31.00 ( 0.00%) 2844.00 (-9074.19%)*
Lat 20.0th-qrtle-64 26528.00 ( 0.00%) 15504.00 ( 41.56%)
Lat 50.0th-qrtle-128 7.00 ( 0.00%) 26.00 (-271.43%)
Lat 90.0th-qrtle-128 11.00 ( 0.00%) 2244.00 (-20300.00%)
Lat 99.0th-qrtle-128 17.00 ( 0.00%) 4488.00 (-26300.00%)
Lat 99.9th-qrtle-128 27.00 ( 0.00%) 5752.00 (-21203.70%)*
Lat 20.0th-qrtle-128 53184.00 ( 0.00%) 24544.00 ( 53.85%)
Lat 50.0th-qrtle-256 172.00 ( 0.00%) 135.00 ( 21.51%)
Lat 90.0th-qrtle-256 2084.00 ( 0.00%) 2022.00 ( 2.98%)
Lat 99.0th-qrtle-256 2780.00 ( 0.00%) 3908.00 ( -40.58%)
Lat 99.9th-qrtle-256 4536.00 ( 0.00%) 5832.00 ( -28.57%)*
Lat 20.0th-qrtle-256 53568.00 ( 0.00%) 51904.00 ( 3.11%)
Lat 50.0th-qrtle-319 369.00 ( 0.00%) 358.00 ( 2.98%)
Lat 90.0th-qrtle-319 2428.00 ( 0.00%) 2436.00 ( -0.33%)
Lat 99.0th-qrtle-319 4552.00 ( 0.00%) 4664.00 ( -2.46%)
Lat 99.9th-qrtle-319 6104.00 ( 0.00%) 6632.00 ( -8.65%)*
Lat 20.0th-qrtle-319 48192.00 ( 0.00%) 48832.00 ( -1.33%)
We can see wakeup latency improvement in a wider range when running
different number of threads. But there is still regression starting from
thread number 64 - maybe the benefit of LLC locality is offset by the
task stacking on 1 LLC. One possible direction I'm thinking of is that,
we can get a snapshot of LLC status in load balance, check if the LLC is
overloaded, if yes, do not enable this LLC aggregation during task
wakeup - but do it in the load balancer, which is less frequent.
thanks,
Chenyu
Powered by blists - more mailing lists