[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200731073618.GA28399@in.ibm.com>
Date: Fri, 31 Jul 2020 13:06:18 +0530
From: Gautham R Shenoy <ego@...ux.vnet.ibm.com>
To: Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Cc: Valentin Schneider <valentin.schneider@....com>,
Michael Ellerman <mpe@...erman.id.au>,
linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
LKML <linux-kernel@...r.kernel.org>,
Nicholas Piggin <npiggin@...il.com>,
Anton Blanchard <anton@...abs.org>,
"Oliver O'Halloran" <oohall@...il.com>,
Nathan Lynch <nathanl@...ux.ibm.com>,
Michael Neuling <mikey@...ling.org>,
Gautham R Shenoy <ego@...ux.vnet.ibm.com>,
Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Jordan Niethe <jniethe5@...il.com>
Subject: Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Hi Srikar, Valentin,
On Wed, Jul 29, 2020 at 11:43:55AM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@....com> [2020-07-28 16:03:11]:
>
[..snip..]
> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
>
> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
>
> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
>
> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.
I am observing that SD_SHARE_PKG_RESOURCES at L2 provides the best
results for POWER9 in terms of cache-benefits during wakeup. On a
POWER9 Boston machine, running a producer-consumer test case
(https://github.com/gautshen/misc/blob/master/producer_consumer/producer_consumer.c)
The test case creates two threads, one Producer and another
Consumer. Both work on a fairly large shared array of size 64M. In an
interation the Producer performs stores to 1024 random locations and
wakes up the Consumer. In the Consumer's iteration, loads from those
exact 1024 locations.
We measure the number of Consumer iterations per second and the
average time for each Consumer iteration. The smaller the time, the
better it is.
The following results are when I pinned the Producer and Consumer to
different combinations of CPUs to cover Small core , Big-core,
Neighbouring Big-core, Far off core within the same chip, and across
chips. There is a also a case where they are not affined anywhere, and
we let the scheduler wake them up correctly.
We find the best results when the Producer and Consumer are within the
same L2 domain. These numbers are also close to the numbers that we
get when we let the Scheduler wake them up (where LLC is L2).
## Same Small core (4 threads: Shares L1, L2, L3, Frequency Domain)
Consumer affined to CPU 3
Producer affined to CPU 1
4698 iterations, avg time: 20034 ns
4951 iterations, avg time: 20012 ns
4957 iterations, avg time: 19971 ns
4968 iterations, avg time: 19985 ns
4970 iterations, avg time: 19977 ns
## Same Big Core (8 threads: Shares L2, L3, Frequency Domain)
Consumer affined to CPU 7
Producer affined to CPU 1
4580 iterations, avg time: 19403 ns
4851 iterations, avg time: 19373 ns
4849 iterations, avg time: 19394 ns
4856 iterations, avg time: 19394 ns
4867 iterations, avg time: 19353 ns
## Neighbouring Big-core (Faster data-snooping from L2. Shares L3, Frequency Domain)
Producer affined to CPU 1
Consumer affined to CPU 11
4270 iterations, avg time: 24158 ns
4491 iterations, avg time: 24157 ns
4500 iterations, avg time: 24148 ns
4516 iterations, avg time: 24164 ns
4518 iterations, avg time: 24165 ns
## Any other Big-core from Same Chip (Shares L3)
Producer affined to CPU 1
Consumer affined to CPU 87
4176 iterations, avg time: 27953 ns
4417 iterations, avg time: 27925 ns
4415 iterations, avg time: 27934 ns
4417 iterations, avg time: 27983 ns
4430 iterations, avg time: 27958 ns
## Different Chips (No cache-sharing)
Consumer affined to CPU 175
Producer affined to CPU 1
3277 iterations, avg time: 50786 ns
3063 iterations, avg time: 50732 ns
2831 iterations, avg time: 50737 ns
2859 iterations, avg time: 50688 ns
2849 iterations, avg time: 50722 ns
## Without affining them (Let Scheduler wake-them up appropriately)
Consumer affined to CPU 0-175
Producer affined to CPU 0-175
4821 iterations, avg time: 19412 ns
4863 iterations, avg time: 19435 ns
4855 iterations, avg time: 19381 ns
4811 iterations, avg time: 19458 ns
4892 iterations, avg time: 19429 ns
--
Thanks and Regards
gautham.
Powered by blists - more mailing lists