[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e000b124-afd4-28e1-fde2-393b0e38ce19@amd.com>
Date: Wed, 15 Jun 2022 19:49:22 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Yicong Yang <yangyicong@...ilicon.com>, peterz@...radead.org,
mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, tim.c.chen@...ux.intel.com,
gautham.shenoy@....com, linux-kernel@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
bristot@...hat.com, prime.zeng@...wei.com,
jonathan.cameron@...wei.com, ego@...ux.vnet.ibm.com,
srikar@...ux.vnet.ibm.com, linuxarm@...wei.com, 21cnbao@...il.com,
guodong.xu@...aro.org, hesham.almatary@...wei.com,
john.garry@...wei.com, shenyang39@...wei.com
Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and
cpus_share_resources API
Hello Yicong,
I replied to the v3 of the series by mistake!
(https://lore.kernel.org/lkml/0b065646-5b05-cbdb-b20c-e1dfef3f4d79@amd.com/)
But rest assured all the analysis discussed there was done with
the v4 patch series. I'll add the same observations below so we
can continue discussion on v4.
We are observing some serious regression with tbench with this patch
series applied. The issue doesn't seem to be related to the actual
functionality of the patch but how the patch changes the per CPU
variable layout.
Discussed below are the results from running tbench on a dual
socket Zen3 (2 x 64C/128T) configured in different NPS modes.
NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:
NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
Node 0: 0-63, 128-191
Node 1: 64-127, 192-255
NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.
Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255
NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.
Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255
Benchmark Results:
Kernel versions:
- tip: 5.19.0-rc2 tip sched/core
- cluster: 5.19.0-rc2 tip sched/core + both the patches of the series
When we started testing, the tip was at:
commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"
* - Data points of concern
~~~~~~
tbench
~~~~~~
NPS1
Clients: tip cluster
1 444.41 (0.00 pct) 439.27 (-1.15 pct)
2 879.23 (0.00 pct) 831.49 (-5.42 pct) *
4 1648.83 (0.00 pct) 1608.07 (-2.47 pct)
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) *
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) *
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) *
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) *
128 30795.27 (0.00 pct) 30861.34 (0.21 pct)
256 25138.43 (0.00 pct) 24711.90 (-1.69 pct)
512 51287.93 (0.00 pct) 51855.55 (1.10 pct)
1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct)
NPS2
Clients: tip cluster
1 445.45 (0.00 pct) 441.75 (-0.83 pct)
2 869.24 (0.00 pct) 845.61 (-2.71 pct)
4 1644.28 (0.00 pct) 1586.49 (-3.51 pct)
8 3120.83 (0.00 pct) 2967.01 (-4.92 pct) *
16 5972.29 (0.00 pct) 5208.58 (-12.78 pct) *
32 11776.38 (0.00 pct) 10229.53 (-13.13 pct) *
64 20933.15 (0.00 pct) 17033.45 (-18.62 pct) *
128 32195.00 (0.00 pct) 29507.85 (-8.34 pct) *
256 24641.52 (0.00 pct) 27225.00 (10.48 pct)
512 50806.96 (0.00 pct) 51377.50 (1.12 pct)
1024 51993.96 (0.00 pct) 50773.35 (-2.34 pct)
NPS4
Clients: tip cluster
1 442.10 (0.00 pct) 435.06 (-1.59 pct)
2 870.94 (0.00 pct) 858.64 (-1.41 pct)
4 1615.30 (0.00 pct) 1607.27 (-0.49 pct)
8 3195.95 (0.00 pct) 3020.63 (-5.48 pct) *
16 5937.41 (0.00 pct) 5719.87 (-3.66 pct)
32 11800.41 (0.00 pct) 11229.65 (-4.83 pct) *
64 20844.71 (0.00 pct) 20432.79 (-1.97 pct)
128 31003.62 (0.00 pct) 29441.20 (-5.03 pct) *
256 27476.37 (0.00 pct) 25857.30 (-5.89 pct) * [Know to have run to run variance]
512 52276.72 (0.00 pct) 51659.16 (-1.18 pct)
1024 51372.10 (0.00 pct) 51026.87 (-0.67 pct)
Note: tbench results for 256 workers are known to have
run to run variation on the test machine. Any regression
seen for the data point can be safely ignored.
The behavior is consistent for both tip and patched kernel
across multiple runs of tbench.
~~~~~~~~~~~~~~~~~~~~
Analysis done so far
~~~~~~~~~~~~~~~~~~~~
To root cause this issue quicker, we have focused on 8 to 64 clients
data points with the machine running in NPS1 mode.
- Even on disabling HW prefetcher, the behavior remains consistent
signifying HW prefetcher is not the cause of the problem.
- Bisecting:
When we ran the tests with only Patch 1 of the series, the
regression was visible and the numbers were worse.
Clients: tip cluster Patch 1 Only
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct)
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct)
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct)
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct)
We further bisected the hunks to narrow down the cause to the per CPU
variable declarations.
On 6/9/2022 5:36 PM, Yicong Yang wrote:
>
> [..snip..]
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 01259611beb9..b9bcfcf8d14d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DECLARE_PER_CPU(int, sd_llc_size);
> DECLARE_PER_CPU(int, sd_llc_id);
> +DECLARE_PER_CPU(int, sd_share_id);
> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
The main reason for the regression seems to be the above declarations.
The regression seem to go away if we do one of the following:
- Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY()
instead of DECLARE_PER_CPU() and change the corresponding definition
below to DEFINE_PER_CPU_READ_MOSTLY().
Clients: tip Patch 1 Patch 1 (READ_MOSTLY)
8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3237.33 (-0.56 pct)
16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5914.53 (-2.92 pct)
32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11536.05 (3.40 pct)
64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21162.33 (0.67 pct)
- Convert sd_share_id and sd_cluster to static arrays.
Clients: tip Patch 1 Patch 1 (Static Array)
8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3203.27 (-1.61 pct)
16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 6198.35 (1.73 pct)
32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11385.76 (2.05 pct)
64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21919.80 (4.28 pct)
- Move the declarations of sd_share_id and sd_cluster to the top
Clients: tip Patch 1 Patch 1 (Declarion on Top)
8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct)
16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct)
32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct)
64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct)
Unfortunately, none of these are complete solutions. For example, using
DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression
but doesn't eliminate it entirely:
Clients: tip cluster cluster (READ_MOSTLY)
1 444.41 (0.00 pct) 439.27 (-1.15 pct) 435.95 (-1.90 pct)
2 879.23 (0.00 pct) 831.49 (-5.42 pct) 842.09 (-4.22 pct)
4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1598.77 (-3.03 pct)
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3090.86 (-5.29 pct) *
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5360.28 (-10.82 pct) *
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11083.66 (-8.08 pct) *
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 20984.30 (-1.28 pct)
128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 30735.20 (-0.19 pct)
256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 24021.21 (-4.44 pct)
512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 51672.73 (0.75 pct)
1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52620.02 (-1.04 pct)
We are still trying to root cause the underlying issue that
brought about such drastic regression in tbench performance.
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 05b6c2ad90b9..0595827d481d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_share_id);
> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>
> [..snip..]
>
We would like some time to investigate this issue and root cause
the reason for this regression.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists