[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <eadede2d-227b-cf99-cf92-4781d73d1352@efficios.com>
Date: Fri, 18 Aug 2023 10:35:21 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: K Prateek Nayak <kprateek.nayak@....com>,
Vincent Guittot <vincent.guittot@...aro.org>
Cc: Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Swapnil Sapkal <Swapnil.Sapkal@....com>,
Aaron Lu <aaron.lu@...el.com>, x86@...nel.org
Subject: Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups
across different L2 caches
On 8/18/23 02:39, K Prateek Nayak wrote:
> Hello Vincent, Mathieu,
>
> On 8/17/2023 9:31 PM, Vincent Guittot wrote:
>> On Thu, 17 Aug 2023 at 17:34, Mathieu Desnoyers
>> <mathieu.desnoyers@...icios.com> wrote:
>>>
>>> Skipping queued wakeups for all logical CPUs sharing an LLC means that
>>> on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), groups
>>> of 8 cores (16 hardware threads) end up grabbing runqueue locks of other
>>> runqueues within the same group for each wakeup, causing contention on
>>> the runqueue locks.
>>>
>>> Improve this by only considering hardware threads sharing an L2 cache as
>>> candidates for skipping use of the queued wakeups.
>>>
>>> This results in the following benchmark improvements:
>>>
>>> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>>>
>>> from 49s to 34s. (30% speedup)
>>>
>>> And similarly with perf bench:
>>>
>>> perf bench sched messaging -g 32 -p -t -l 100000
>>>
>>> from 10.9s to 7.4s (32% speedup)
>>>
>>> This was developed as part of the investigation into a weird regression
>>> reported by AMD where adding a raw spinlock in the scheduler context
>>> switch accelerated hackbench. It turned out that changing this raw
>>> spinlock for a loop of 10000x cpu_relax within do_idle() had similar
>>> benefits.
>>>
>>> This patch achieves a similar effect without busy waiting nor changing
>>> anything about runqueue selection on wakeup. It considers that only
>>> hardware threads sharing an L2 cache should skip the queued
>>> try-to-wakeup and directly grab the target runqueue lock, rather than
>>> allowing all hardware threads sharing an LLC to do so.
>>>
>>> I would be interested to hear feedback about performance impact of this
>>> patch (improvement or regression) on other workloads and hardware,
>>> especially for Intel CPUs. One thing that we might want to empirically
>>> figure out from the topology is whether there is a maximum number of
>>> hardware threads within an LLC below which it would make sense to use
>>> the LLC rather than L2 as group within which queued wakeups can be
>>> skipped.
>>>
>>> [ Only tested on AMD CPUs so far. ]
>>>
>>> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
>>> Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
>>> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>> Cc: Ingo Molnar <mingo@...hat.com>
>>> Cc: Peter Zijlstra <peterz@...radead.org>
>>> Cc: Valentin Schneider <vschneid@...hat.com>
>>> Cc: Steven Rostedt <rostedt@...dmis.org>
>>> Cc: Ben Segall <bsegall@...gle.com>
>>> Cc: Mel Gorman <mgorman@...e.de>
>>> Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
>>> Cc: Vincent Guittot <vincent.guittot@...aro.org>
>>> Cc: Juri Lelli <juri.lelli@...hat.com>
>>> Cc: Swapnil Sapkal <Swapnil.Sapkal@....com>
>>> Cc: Aaron Lu <aaron.lu@...el.com>
>>> Cc: x86@...nel.org
>>> ---
>>> arch/Kconfig | 6 ++++++
>>> arch/x86/Kconfig | 1 +
>>> drivers/base/Kconfig | 1 +
>>> include/linux/sched/topology.h | 3 ++-
>>> kernel/sched/core.c | 26 +++++++++++++++++++++++---
>>> 5 files changed, 33 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/Kconfig b/arch/Kconfig
>>> index 205fd23e0cad..e5aac1741712 100644
>>> --- a/arch/Kconfig
>>> +++ b/arch/Kconfig
>>> @@ -340,6 +340,12 @@ config HAVE_ASM_MODVERSIONS
>>> <asm/asm-prototypes.h> to support the module versioning for symbols
>>> exported from assembly code.
>>>
>>> +config HAVE_CLUSTERGROUP
>>> + bool
>>> + help
>>> + This symbol should be selected by an architecture if it
>>> + implements CPU clustergroup.
>>> +
>>> config HAVE_REGS_AND_STACK_ACCESS_API
>>> bool
>>> help
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index cb1031018afa..07813a1a9a58 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -299,6 +299,7 @@ config X86
>>> select FUNCTION_ALIGNMENT_4B
>>> imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
>>> select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>>> + select HAVE_CLUSTERGROUP
>>>
>>> config INSTRUCTION_DECODER
>>> def_bool y
>>> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
>>> index 2b8fd6bb7da0..408aaf7a4bd1 100644
>>> --- a/drivers/base/Kconfig
>>> +++ b/drivers/base/Kconfig
>>> @@ -218,6 +218,7 @@ config DMA_FENCE_TRACE
>>>
>>> config GENERIC_ARCH_TOPOLOGY
>>> bool
>>> + select HAVE_CLUSTERGROUP
>>> help
>>> Enable support for architectures common topology code: e.g., parsing
>>> CPU capacity information from DT, usage of such information for
>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>> index 816df6cc444e..714386070463 100644
>>> --- a/include/linux/sched/topology.h
>>> +++ b/include/linux/sched/topology.h
>>> @@ -178,7 +178,8 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>>> cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
>>> void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
>>>
>>> -bool cpus_share_cache(int this_cpu, int that_cpu);
>>> +bool cpus_share_cluster(int this_cpu, int that_cpu); /* Share L2. */
>>> +bool cpus_share_cache(int this_cpu, int that_cpu); /* Share LLC. */
>>
>> I think that Yicong is doing what you want with
>> cpus_share_lowest_cache() which points to cluster when available or
>> LLC otherwise
>> https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@hisilicon.com/t/#m0ab9fa0fe0c3779b9bbadcfbc1b643dce7cb7618
> Please correct me if I'm wrong, but with Yicong's latest version
> (https://lore.kernel.org/lkml/20230719092838.2302-2-yangyicong@huawei.com/)
> "sd_share_id" will follow "sd_llc_id" if the SD_CLUSTER domain is
> degenerated, which is the case with any system where CLUSTER domain is
> same as SMT domain.
>
> On logging the cpu and sd_share_id on my 3rd Gen EPYC system I see,
>
> CPU(0) sd_share_id(0)
> CPU(1) sd_share_id(0)
> CPU(2) sd_share_id(0)
> CPU(3) sd_share_id(0)
> CPU(4) sd_share_id(0)
> CPU(5) sd_share_id(0)
> CPU(6) sd_share_id(0)
> CPU(7) sd_share_id(0)
> CPU(8) sd_share_id(8)
> CPU(9) sd_share_id(8)
> ...
> CPU(127) sd_share_id(120)
> CPU(128) sd_share_id(0)
> CPU(129) sd_share_id(0)
> CPU(130) sd_share_id(0)
> CPU(131) sd_share_id(0)
> CPU(132) sd_share_id(0)
> CPU(133) sd_share_id(0)
> CPU(134) sd_share_id(0)
> CPU(135) sd_share_id(0)
> CPU(136) sd_share_id(8)
> ...
>
> "sd_share_id" follows the "sd_llc_id" since CLUSTER domain is
> degenerated.
>
> # echo Y > /sys/kernel/debug/sched/verbose
> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING # SMT Domain
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING # MC Domain
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING # DIE Domain
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA # NUMA Domain
>
> But I believe Mathieu's case would require falling back to "core_id" if
> the cluster domain has degenerated.
Yes, I've noticed the same thing on my system. I'm preparing an updated
patchset which renames cpus_share_cache() to cpus_share_llc(), and
introduces a new cpus_share_l2c() based on a new sd_l2c_id per-cpu
variable derived from topology_cluster_cpumask(). It seems to work fine
so far.
Thanks,
Mathieu
>
> --
> Thanks and Regards,
> Prateek
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists