[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211011103302.GA65713@kvm.asia-northeast3-a.c.our-ratio-313919.internal>
Date: Mon, 11 Oct 2021 10:33:02 +0000
From: Hyeonggon Yoo <42.hyeyoo@...il.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: David Rientjes <rientjes@...gle.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Christoph Lameter <cl@...ux.com>,
Pekka Enberg <penberg@...nel.org>,
Joonsoo Kim <iamjoonsoo.kim@....com>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Perf and Hackbench results on my machine
Hello Vlastimil.
On Mon, Oct 11, 2021 at 09:21:01AM +0200, Vlastimil Babka wrote:
> On 10/11/21 00:49, David Rientjes wrote:
> > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> >
> >> It's certain that an object will be not only read, but also
> >> written after allocation.
> >>
> >
> > Why is it certain? I think perhaps what you meant to say is that if we
> > are doing any prefetching here, then access will benefit from prefetchw
> > instead of prefetch. But it's not "certain" that allocated memory will be
> > accessed at all.
>
> I think the primary reason there's a prefetch is freelist traversal. The
> cacheline we prefetch will be read during the next allocation, so if we
> expect there to be one soon, prefetch might help.
I agree that.
> That the freepointer is
> part of object itself and thus the cache line will be probably accessed also
> after the allocation, is secondary.
Right. it depends on cache line size and whether first cache line of an
object is frequently accessed or not.
> Yeah this might help some workloads, but
> perhaps hurt others - these things might look obvious in theory but be
> rather unpredictable in practice. At least some hackbench results would help...
>
Below is my measurement. it seems prefetch(w) is not making things worse
at least on hackbench.
Measured on 16 CPUs (ARM64) / 16G RAM
Without prefetch:
Time: 91.989
Performance counter stats for 'hackbench -g 100 -l 10000':
1467926.03 msec cpu-clock # 15.907 CPUs utilized
17782076 context-switches # 12.114 K/sec
957523 cpu-migrations # 652.296 /sec
104561 page-faults # 71.230 /sec
1622117569931 cycles # 1.105 GHz (54.54%)
2002981132267 instructions # 1.23 insn per cycle (54.32%)
5600876429 branch-misses (54.28%)
642657442307 cache-references # 437.800 M/sec (54.27%)
19404890844 cache-misses # 3.019 % of all cache refs (54.28%)
640413686039 L1-dcache-loads # 436.271 M/sec (46.85%)
19110650580 L1-dcache-load-misses # 2.98% of all L1-dcache accesses (46.83%)
651556334841 dTLB-loads # 443.862 M/sec (46.63%)
3193647402 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.84%)
538927659684 iTLB-loads # 367.135 M/sec (54.31%)
118503839 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%)
625750168840 L1-icache-loads # 426.282 M/sec (46.80%)
24348083282 L1-icache-load-misses # 3.89% of all L1-icache accesses (46.78%)
92.284351157 seconds time elapsed
44.524693000 seconds user
1426.214006000 seconds sys
With prefetch:
Time: 91.677
Performance counter stats for 'hackbench -g 100 -l 10000':
1462938.07 msec cpu-clock # 15.908 CPUs utilized
18072550 context-switches # 12.354 K/sec
1018814 cpu-migrations # 696.416 /sec
104558 page-faults # 71.471 /sec
2003670016013 instructions # 1.27 insn per cycle (54.31%)
5702204863 branch-misses (54.28%)
643368500985 cache-references # 439.778 M/sec (54.26%)
18475582235 cache-misses # 2.872 % of all cache refs (54.28%)
642206796636 L1-dcache-loads # 438.984 M/sec (46.87%)
18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%)
653842996501 dTLB-loads # 446.938 M/sec (46.63%)
3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%)
537531951350 iTLB-loads # 367.433 M/sec (54.33%)
114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%)
630135543177 L1-icache-loads # 430.733 M/sec (46.80%)
22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%)
91.964452802 seconds time elapsed
43.416742000 seconds user
1422.441123000 seconds sys
With prefetchw:
Time: 90.220
Performance counter stats for 'hackbench -g 100 -l 10000':
1437418.48 msec cpu-clock # 15.880 CPUs utilized
17694068 context-switches # 12.310 K/sec
958257 cpu-migrations # 666.651 /sec
100604 page-faults # 69.989 /sec
1583259429428 cycles # 1.101 GHz (54.57%)
2004002484935 instructions # 1.27 insn per cycle (54.37%)
5594202389 branch-misses (54.36%)
643113574524 cache-references # 447.409 M/sec (54.39%)
18233791870 cache-misses # 2.835 % of all cache refs (54.37%)
640205852062 L1-dcache-loads # 445.386 M/sec (46.75%)
17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%)
651747432274 dTLB-loads # 453.415 M/sec (46.59%)
3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%)
535395273064 iTLB-loads # 372.470 M/sec (54.38%)
113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%)
628871845924 L1-icache-loads # 437.501 M/sec (46.80%)
22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%)
90.514819303 seconds time elapsed
43.877656000 seconds user
1397.176001000 seconds sys
Thanks,
Hyeonggon
Powered by blists - more mailing lists