linux-kernel - [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230817153515.143932-1-mathieu.desnoyers@efficios.com>
Date:   Thu, 17 Aug 2023 11:35:15 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Ingo Molnar <mingo@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@....com>,
        Aaron Lu <aaron.lu@...el.com>, x86@...nel.org
Subject: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches

Skipping queued wakeups for all logical CPUs sharing an LLC means that
on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), groups
of 8 cores (16 hardware threads) end up grabbing runqueue locks of other
runqueues within the same group for each wakeup, causing contention on
the runqueue locks.

Improve this by only considering hardware threads sharing an L2 cache as
candidates for skipping use of the queued wakeups.

This results in the following benchmark improvements:

    hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 34s. (30% speedup)

And similarly with perf bench:

    perf bench sched messaging -g 32 -p -t -l 100000

from 10.9s to 7.4s (32% speedup)

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.

This patch achieves a similar effect without busy waiting nor changing
anything about runqueue selection on wakeup. It considers that only
hardware threads sharing an L2 cache should skip the queued
try-to-wakeup and directly grab the target runqueue lock, rather than
allowing all hardware threads sharing an LLC to do so.

I would be interested to hear feedback about performance impact of this
patch (improvement or regression) on other workloads and hardware,
especially for Intel CPUs. One thing that we might want to empirically
figure out from the topology is whether there is a maximum number of
hardware threads within an LLC below which it would make sense to use
the LLC rather than L2 as group within which queued wakeups can be
skipped.

[ Only tested on AMD CPUs so far. ]

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Valentin Schneider <vschneid@...hat.com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Ben Segall <bsegall@...gle.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Juri Lelli <juri.lelli@...hat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@....com>
Cc: Aaron Lu <aaron.lu@...el.com>
Cc: x86@...nel.org
---
 arch/Kconfig                   |  6 ++++++
 arch/x86/Kconfig               |  1 +
 drivers/base/Kconfig           |  1 +
 include/linux/sched/topology.h |  3 ++-
 kernel/sched/core.c            | 26 +++++++++++++++++++++++---
 5 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..e5aac1741712 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -340,6 +340,12 @@ config HAVE_ASM_MODVERSIONS
 	  <asm/asm-prototypes.h> to support the module versioning for symbols
 	  exported from assembly code.
 
+config HAVE_CLUSTERGROUP
+	bool
+	help
+	  This symbol should be selected by an architecture if it
+	  implements CPU clustergroup.
+
 config HAVE_REGS_AND_STACK_ACCESS_API
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cb1031018afa..07813a1a9a58 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -299,6 +299,7 @@ config X86
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+	select HAVE_CLUSTERGROUP
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 2b8fd6bb7da0..408aaf7a4bd1 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -218,6 +218,7 @@ config DMA_FENCE_TRACE
 
 config GENERIC_ARCH_TOPOLOGY
 	bool
+	select HAVE_CLUSTERGROUP
 	help
 	  Enable support for architectures common topology code: e.g., parsing
 	  CPU capacity information from DT, usage of such information for
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..714386070463 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -178,7 +178,8 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
-bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_cluster(int this_cpu, int that_cpu);	/* Share L2. */
+bool cpus_share_cache(int this_cpu, int that_cpu);	/* Share LLC. */
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 typedef int (*sched_domain_flags_f)(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..ce3402b81e5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3904,13 +3904,33 @@ void wake_up_if_idle(int cpu)
 	rcu_read_unlock();
 }
 
+/*
+ * Query whether CPUs share LLC.
+ */
 bool cpus_share_cache(int this_cpu, int that_cpu)
+{
+	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
+}
+
+#ifdef CONFIG_HAVE_CLUSTERGROUP
+/*
+ * Query whether CPUs share L2 cache.
+ */
+bool cpus_share_cluster(int this_cpu, int that_cpu)
 {
 	if (this_cpu == that_cpu)
 		return true;
-
-	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
+	return cpumask_test_cpu(that_cpu, cpu_clustergroup_mask(this_cpu));
+}
+#else
+/*
+ * Fall-back on querying whether CPUs share LLC.
+ */
+bool cpus_share_cluster(int this_cpu, int that_cpu)
+{
+	return cpus_share_cache(this_cpu, that_cpu);
 }
+#endif
 
 static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 {
@@ -3929,7 +3949,7 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 	 * If the CPU does not share cache, then queue the task on the
 	 * remote rqs wakelist to avoid accessing remote data.
 	 */
-	if (!cpus_share_cache(smp_processor_id(), cpu))
+	if (!cpus_share_cluster(smp_processor_id(), cpu))
 		return true;
 
 	if (cpu == smp_processor_id())
-- 
2.39.2