linux-kernel - Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCavFCk+1cJe2=zFa7WfiX4XGMdc5AsA_2r4xqsk+4v7Q@mail.gmail.com>
Date:   Thu, 17 Aug 2023 18:01:05 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@....com>,
        Aaron Lu <aaron.lu@...el.com>, x86@...nel.org
Subject: Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups
 across different L2 caches

On Thu, 17 Aug 2023 at 17:34, Mathieu Desnoyers
<mathieu.desnoyers@...icios.com> wrote:
>
> Skipping queued wakeups for all logical CPUs sharing an LLC means that
> on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), groups
> of 8 cores (16 hardware threads) end up grabbing runqueue locks of other
> runqueues within the same group for each wakeup, causing contention on
> the runqueue locks.
>
> Improve this by only considering hardware threads sharing an L2 cache as
> candidates for skipping use of the queued wakeups.
>
> This results in the following benchmark improvements:
>
>     hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>
> from 49s to 34s. (30% speedup)
>
> And similarly with perf bench:
>
>     perf bench sched messaging -g 32 -p -t -l 100000
>
> from 10.9s to 7.4s (32% speedup)
>
> This was developed as part of the investigation into a weird regression
> reported by AMD where adding a raw spinlock in the scheduler context
> switch accelerated hackbench. It turned out that changing this raw
> spinlock for a loop of 10000x cpu_relax within do_idle() had similar
> benefits.
>
> This patch achieves a similar effect without busy waiting nor changing
> anything about runqueue selection on wakeup. It considers that only
> hardware threads sharing an L2 cache should skip the queued
> try-to-wakeup and directly grab the target runqueue lock, rather than
> allowing all hardware threads sharing an LLC to do so.
>
> I would be interested to hear feedback about performance impact of this
> patch (improvement or regression) on other workloads and hardware,
> especially for Intel CPUs. One thing that we might want to empirically
> figure out from the topology is whether there is a maximum number of
> hardware threads within an LLC below which it would make sense to use
> the LLC rather than L2 as group within which queued wakeups can be
> skipped.
>
> [ Only tested on AMD CPUs so far. ]
>
> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
> Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Valentin Schneider <vschneid@...hat.com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Ben Segall <bsegall@...gle.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
> Cc: Vincent Guittot <vincent.guittot@...aro.org>
> Cc: Juri Lelli <juri.lelli@...hat.com>
> Cc: Swapnil Sapkal <Swapnil.Sapkal@....com>
> Cc: Aaron Lu <aaron.lu@...el.com>
> Cc: x86@...nel.org
> ---
>  arch/Kconfig                   |  6 ++++++
>  arch/x86/Kconfig               |  1 +
>  drivers/base/Kconfig           |  1 +
>  include/linux/sched/topology.h |  3 ++-
>  kernel/sched/core.c            | 26 +++++++++++++++++++++++---
>  5 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 205fd23e0cad..e5aac1741712 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -340,6 +340,12 @@ config HAVE_ASM_MODVERSIONS
>           <asm/asm-prototypes.h> to support the module versioning for symbols
>           exported from assembly code.
>
> +config HAVE_CLUSTERGROUP
> +       bool
> +       help
> +         This symbol should be selected by an architecture if it
> +         implements CPU clustergroup.
> +
>  config HAVE_REGS_AND_STACK_ACCESS_API
>         bool
>         help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index cb1031018afa..07813a1a9a58 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -299,6 +299,7 @@ config X86
>         select FUNCTION_ALIGNMENT_4B
>         imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
>         select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
> +       select HAVE_CLUSTERGROUP
>
>  config INSTRUCTION_DECODER
>         def_bool y
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 2b8fd6bb7da0..408aaf7a4bd1 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -218,6 +218,7 @@ config DMA_FENCE_TRACE
>
>  config GENERIC_ARCH_TOPOLOGY
>         bool
> +       select HAVE_CLUSTERGROUP
>         help
>           Enable support for architectures common topology code: e.g., parsing
>           CPU capacity information from DT, usage of such information for
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 816df6cc444e..714386070463 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -178,7 +178,8 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>  cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
>  void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
>
> -bool cpus_share_cache(int this_cpu, int that_cpu);
> +bool cpus_share_cluster(int this_cpu, int that_cpu);   /* Share L2. */
> +bool cpus_share_cache(int this_cpu, int that_cpu);     /* Share LLC. */

I think that Yicong is doing what you want with
cpus_share_lowest_cache() which points to cluster when available or
LLC otherwise
https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@hisilicon.com/t/#m0ab9fa0fe0c3779b9bbadcfbc1b643dce7cb7618

>
>  typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
>  typedef int (*sched_domain_flags_f)(void);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a68d1276bab0..ce3402b81e5e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3904,13 +3904,33 @@ void wake_up_if_idle(int cpu)
>         rcu_read_unlock();
>  }
>
> +/*
> + * Query whether CPUs share LLC.
> + */
>  bool cpus_share_cache(int this_cpu, int that_cpu)
> +{
> +       return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
> +}
> +
> +#ifdef CONFIG_HAVE_CLUSTERGROUP
> +/*
> + * Query whether CPUs share L2 cache.
> + */
> +bool cpus_share_cluster(int this_cpu, int that_cpu)
>  {
>         if (this_cpu == that_cpu)
>                 return true;
> -
> -       return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
> +       return cpumask_test_cpu(that_cpu, cpu_clustergroup_mask(this_cpu));
> +}
> +#else
> +/*
> + * Fall-back on querying whether CPUs share LLC.
> + */
> +bool cpus_share_cluster(int this_cpu, int that_cpu)
> +{
> +       return cpus_share_cache(this_cpu, that_cpu);
>  }
> +#endif
>
>  static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
>  {
> @@ -3929,7 +3949,7 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
>          * If the CPU does not share cache, then queue the task on the
>          * remote rqs wakelist to avoid accessing remote data.
>          */
> -       if (!cpus_share_cache(smp_processor_id(), cpu))
> +       if (!cpus_share_cluster(smp_processor_id(), cpu))
>                 return true;
>
>         if (cpu == smp_processor_id())
> --
> 2.39.2
>