[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c7fe33f9-51bd-80e8-cb0e-1cefb20a61b9@efficios.com>
Date: Wed, 23 Aug 2023 14:52:17 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Swapnil Sapkal <Swapnil.Sapkal@....com>,
Aaron Lu <aaron.lu@...el.com>,
Julien Desfossez <jdesfossez@...italocean.com>, x86@...nel.org
Subject: Re: [RFC PATCH v3 2/3] sched: Introduce cpus_share_l2c
On 8/23/23 11:26, Mathieu Desnoyers wrote:
> On 8/22/23 07:31, Mathieu Desnoyers wrote:
>> Introduce cpus_share_l2c to allow querying whether two logical CPUs
>> share a common L2 cache.
>>
>> Considering a system like the AMD EPYC 9654 96-Core Processor, the L1
>> cache has a latency of 4-5 cycles, the L2 cache has a latency of at
>> least 14ns, whereas the L3 cache has a latency of 50ns [1]. Compared to
>> this, I measured the RAM accesses to a latency around 120ns on my
>> system [2]. So L3 really is only 2.4x faster than RAM accesses.
>> Therefore, with this relatively slow access speed compared to L2, the
>> scheduler will benefit from only considering CPUs sharing an L2 cache
>> for the purpose of using remote runqueue locking rather than queued
>> wakeups.
>
> So I did some more benchmarking to figure out whether the reason for
> this speedup is the latency delta between L2 and L3, or is due to the
> number of hw threads contending on the rq locks.
>
> I tried to force grouping of those "skip ttwu queue" groups by a subset
> of the LLC id, basically by taking the LLC id and adding the cpu number
> modulo N, where N is chosen based on my machine topology.
>
> The end result is that I have similar numbers for groups of 1, 2, 4 HW
> threads (which use rq locks and skip queued ttwu within the group).
> Starting with group of size 8, the performance starts to degrade.
>
> So I wonder: do machines with more than 4 HW threads per L2 cache exist?
> If it's the case, there we should think about grouping not only by L2
> cache, but also sub-divide this group so the number of hw threads per
> group is at most 4.
>
> Here are my results with the hackbench test-case:
>
> Group cpus by 16 hw threads:
>
> Time: 49s
>
> - group cpus by 8 hw threads: (llc_id + cpu modulo 2)
>
> Time: 39s
>
> - group cpus by 4 hw threads: (llc_id + cpu modulo 4)
>
> Time: 34s
>
> - group cpus by 2 hw threads: (llc_id + cpu modulo 8)
> (expect same as L2 grouping on this machine)
>
> Time: 34s
>
> - group cpus by 1 hw threads: (cpu)
>
> Time: 33s
One more interesting data point: I tried modifying the grouping
so that I would explicitly group by hw threads which sit in different
L3, and even on different NUMA nodes for some
(group id = cpu_id % 192). This is expected to generate really _bad_
cache locality for the runqueue locks within a group.
The result for these groups of 3 HW threads is about 33s with the
hackbench benchmark, which seems to confirm that the cause of the
speedup is reduction of the contention on the rq locks by making the
groups smaller, and therefore reducing the likelihood of contention for
the rq locks, rather than by improving cache locality from L3 to L2.
So grouping by shared L2 only happens to make the group size OK, but
this benchmark does not significantly benefit from having all runqueue
locks on the same L2.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists