[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com>
Date: Fri, 2 Apr 2021 11:07:54 +0530
From: "Gautham R. Shenoy" <ego@...ux.vnet.ibm.com>
To: Michael Ellerman <mpe@...erman.id.au>,
Michael Neuling <mikey@...ling.org>,
Mel Gorman <mgorman@...hsingularity.net>,
Rik van Riel <riel@...riel.com>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
Valentin Schneider <valentin.schneider@....com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Nicholas Piggin <npiggin@...il.com>,
Anton Blanchard <anton@...abs.org>,
Parth Shah <parth@...ux.ibm.com>,
Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>
Cc: LKML <linux-kernel@...r.kernel.org>, linuxppc-dev@...ts.ozlabs.org,
"Gautham R. Shenoy" <ego@...ux.vnet.ibm.com>
Subject: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
From: "Gautham R. Shenoy" <ego@...ux.vnet.ibm.com>
On POWER10 systems, the L2 cache is at the SMT4 small core level. The
following commits ensure that L2 cache gets correctly discovered and
the Last-Level-Cache domain (LLC) is set to the SMT sched-domain.
790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties
1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make
it generic
538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache
0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
However, with the LLC now on the SMT sched-domain, we are seeing some
regressions in the performance of applications that requires
single-threaded performance. The reason for this is as follows:
Prior to the change (we call this P9-sched below), the sched-domain
hierarchy was:
SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE
where the CACHE sched-domain is defined to be the Last Level Cache (LLC).
On the upstream kernel, with the aforementioned commmits (P10-sched),
the sched-domain hierarchy is:
SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE
with the SMT sched-domain as the LLC.
When the scheduler tries to wakeup a task, it chooses between the
waker-CPU and the wakee's previous-CPU. Suppose this choice is called
the "target", then in the target's LLC domain, the scheduler
a) tries to find an idle core in the LLC. This helps exploit the
SMT folding that the wakee task can benefit from. If an idle
core is found, the wakee is woken up on it.
b) Failing to find an idle core, the scheduler tries to find an idle
CPU in the LLC. This helps minimise the wakeup latency for the
wakee since it gets to run on the CPU immediately.
c) Failing this, it will wake it up on target CPU.
Thus, with P9-sched topology, since the CACHE domain comprises of two
SMT4 cores, there is a decent chance that we get an idle core, failing
which there is a relatively higher probability of finding an idle CPU
among the 8 threads in the domain.
However, in P10-sched topology, since the SMT domain is the LLC and it
contains only a single SMT4 core, the probability that we find that
core to be idle is less. Furthermore, since there are only 4 CPUs to
search for an idle CPU, there is lower probability that we can get an
idle CPU to wake up the task on.
Thus applications which require single threaded performance will end
up getting woken up on potentially busy core, even though there are
idle cores in the system.
To remedy this, this patch proposes that the LLC be moved to the MC
level which is a group of cores in one half of the chip.
SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
While there is no cache being shared at this level, this is still the
level where some amount of cache-snooping takes place and it is
relatively faster to access the data from the caches of the cores
within this domain. With this change, we no longer see regressions on
P10 for applications which require single threaded performance.
The patch also improves the tail latencies on schbench and the
usecs/op on "perf bench sched pipe"
On a 10 core P10 system with 80 CPUs,
schbench
============
(https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/)
Values : Lower the better.
99th percentile is the tail latency.
99th percentile
~~~~~~~~~~~~~~~~~~
No. messenger
threads 5.12-rc4 5.12-rc4
P10-sched MC-LLC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 70 us 85 us
2 81 us 101 us
3 92 us 107 us
4 96 us 110 us
5 103 us 123 us
6 3412 us ----> 122 us
7 1490 us 136 us
8 6200 us 3572 us
Hackbench
============
(perf bench sched pipe)
values: lower the better
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
No. of
parallel
instances 5.12-rc4 5.12-rc4
P10-sched MC-LLC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 24.04 us/op 18.72 us/op
2 24.04 us/op 18.65 us/op
4 24.01 us/op 18.76 us/op
8 24.10 us/op 19.11 us/op
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Gautham R. Shenoy <ego@...ux.vnet.ibm.com>
---
arch/powerpc/kernel/smp.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5a4d59a..c75dbd4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -976,6 +976,13 @@ static bool has_coregroup_support(void)
return coregroup_enabled;
}
+static int powerpc_mc_flags(void)
+{
+ if(has_coregroup_support())
+ return SD_SHARE_PKG_RESOURCES;
+ return 0;
+}
+
static const struct cpumask *cpu_mc_mask(int cpu)
{
return cpu_coregroup_mask(cpu);
@@ -986,7 +993,7 @@ static const struct cpumask *cpu_mc_mask(int cpu)
{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
#endif
{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
- { cpu_mc_mask, SD_INIT_NAME(MC) },
+ { cpu_mc_mask, powerpc_mc_flags, SD_INIT_NAME(MC) },
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
};
--
1.9.4
Powered by blists - more mailing lists